Did you ever have unexpected or bad data flow into one of your analytical products? Bad data leads to compromised analytical integrity, faulty decision making, and a loss of trust in your data.
As organizations are doing more with data, the need for high quality data only increases. When it comes to machine learning for example, it is crucial to first find an unbiased, clean subset to train your predictive model on. Bad data leads to faulty predictions. This is why companies are warming up to the idea of testing their data.
Testing datasets is not an entirely new concept. In software engineering, TDD, also known as test driven development, has been common practice for many years. Automated testing goes hand in hand with writing software code. If you are not testing your code automatically after every commit, then you’d be considered an amateur instead of a professional. In order to trust software, you need to have automated test suites. This is no different when it comes to data that’s used in production.
The last few years data product development has made big progress. Mainstream adoption of new data technologies became a reality. Even for machine learning, one of the frontiers of software development, many off the shelf solutions can readily be found online.
Next to ML, traditional data warehousing has developed just as well. Companies & consumers started to produce a lot more data, so a new type of cloud-native data warehouse (like Snowflake) came to market, where storage and compute are separated (read fast & scalable), and consumption is based on power & uptime (read low cost).
Before we continue, let’s first define what we mean by “data in production” and introduce the concept of “data product”. A data product is similar to other software products (e.g. they have an interface and get new features that are deployed in dev & production), but it’s different in that the primary goal of the product is to provide value through analytical datasets. So data products heavily rely on data to drive outcomes. Another characteristic is that it’s often the most recent data (how many traffic jams do we have now) that is the most important.
When developing a data product, it makes sense to start testing data as soon as it goes into production. Examples could be testing the daily set of transactions that came in, or the distribution of predictions we’ve made across all segments. It’s vital that the data is clean and as expected, to ensure the integrity of your data product.
Why should you test data products once they are put into production?
Just like in software development, developers test because it is necessary to spot bugs and their underlying causes early as they tend to lead to crashes. Clearly bugs that cause your systems to crash aren’t a good thing but at least they signal that something is wrong. In software development these bugs are easily detectable.
In data product development, testing is even more crucial for a multitude of reasons. Many factors force data to change constantly. The people that cause these changes are frequently in different parts of the organisation and they don't necessarily always assess the impact on the data value chain when making changes to the operational systems. Furthermore, data issues are often silent. A machine learning model or algorithm, for example, will continue to work, even if some of its inputs are off. It can take weeks, months or even years to spot an issue so it’s crucial you detect anomalies as soon as possible. Keeping in mind data lakes & new data warehouses can become very complex, which means manually testing is no longer the only option to prevent data lakes from becoming data swamps.
When should you start testing your data?
The answer here is short and sweet, as soon as possible. This means introducing a culture change into your teams that starts with creating transparency on errors and data deficiencies. We definitely recommend to start testing your first data product, after every build or change you make (even if your product is a report!). Your data products will break when they are not monitored. Knowing why they break is crucial for further data product development. On a more technical level, it’s recommended to test after each step in the pipeline, as well as in between the different pipelines.
Who should be involved in data quality testing?
Historically, it has always been the engineering team and because of their rather limited data domain knowledge, they focussed on operational metrics. Experience has shown that testing data without domain knowledge is quite useless, as many high-impact issues will be hidden in the dataset. We therefore recommend to always include Data SMEs in the testing of the source data that goes into all your data products.
Data product development teams can and should apply some of the best-practices that have been developed in software engineering and data management over the last decades. Data issues often go unnoticed for a while (silent errors) and have a high impact on product quality. Most data products won’t even break so issues go unnoticed for a while, resulting in poor product quality and days/weeks of clean-up work. Nobody wants to be a data janitor, so start testing your data products today!