Data quality, a team sport

By now, most organizations are executing both an offensive and defensive data strategy. The offensive strategy is there to meet revenue & market share objectives, the defensive one to minimize risks through controls (e.g. who has access to what data for what purpose). 

O’Reilly recently conducted a survey that unveiled one of the most pressing problems data teams have: a lack of high-quality data. According to the study, the issue is being elevated to the CxO suite. It spans organizational borders and has become an increasingly critical blocker for achieving business objectives.

When drilling-down into what bad data quality means, the survey respondents (both business and technical stakeholders) identified the lack of organization (accountability) and information about data (metadata) as a primary problem. This because data consumers often don't know who maintains this data and pipelines, how and where its sourced, ... Respondents also identified the lack of controls as a primary problems as well as people/time to help (proactively) resolve issues.

Next to this, the respondents also uncover the vast variety of problems data teams face. Partially because of the variety of datasets (e.g. external vs internal, batch vs real-time, system- vs human- generated). Partially because the context/narrative in which the data originated is most often different than the context in which the data will be used (see data gravity concept).

Let’s take a look at two examples:

Product events dataset. A dataset created by data product managers and engineers, collecting user event data across devices. It’s used to better understand and respond to customer actions that happened (e.g. email a discount code when a user abandons their shopping cart). Common data quality issues in event data are missing data in certain segments (e.g. Android), inconsistent event properties and poorly communicated schema changes.

Customer dataset. This dataset contains dimensions and metrics about our customers. For many organizations a difficult dataset as it is produced by many different teams in the organization. Next to the system generated data, many organizations also have human customer interactions and therefore rely on human data maintenance. Common data quality issues here are incomplete data (lack of human process enforcement) and data processing issues (mistakes in / evolution of business logic in the data pipelines).

Data quality issues are unfortunately often silent for days or weeks on end until someone spots them. The worst data issues start all the way upstream and pollute your data lake, data warehouse, and data products until they are ultimately spotted by your customers doing the QA of your data system for you. As organizations’ data infrastructure becomes more complex (more pipelines & products), transparency is needed on what’s impacted (data lineage).

In order to avoid these scenarios and develop an ability to quickly respond, we recommend data teams to do the following:

1. Document who the owners and subject matter experts are of your data key data sets & sources. Proactively establish a relationship with these data producers.
2. Measure key steps in your data value chain, starting at the end right before data consumption. Monitoring measurements create transparency and are actionable.
3. Provide data producers a dashboard that shows them how well they are meeting the expectations of a given data product.

This may seem like a daunting task, but our experience has proven that, when driven by leaders in data & analytics and supported by the right set of tools, data quality management becomes an important catalyst for change, helping you create a strong data culture of ownership and accountability.

Soda helps companies on this journey. Contact us if you want to learn more!