Automating data quality tests with assertions
Mục lục
Introduction
One of the key assets of the data team is having data which can be trusted. If the data you share can’t be trusted, it won’t be used to make decisions, and your organization will be left to base their decisions on guesswork and intuition.
Modern data teams use automated data quality tests to check the validity of data they provide to their organization before it is used for analytics and decision making.
Creating a single source of truth
As we discussed in a previous chapter, one of the core responsibilities of the data team is creating a “single source of truth” for your organization’s data. To create a single source of truth, raw data is ingested into the warehouse from many different sources. It’s then the data teams responsibility to define the transformation logic that brings that source data together into meaningful datasets for downstream reporting and analytics.
Of course, the single source of truth is only useful if the data is correct.
Bad data is everywhere
As anyone that works with data knows, data quality issues are everywhere. There are various points along your data pipeline where these can creep in:
- The applications that generate the data in the first place can have issues
- The ingestion pipeline, moving data from source to your warehouse, might be buggy
- The transformation code written by the data team may contain errors, or become out of date
Any of these issues can lead to incorrect data in the reports, dashboards or analytics which the data team produces. This can lead to poor decisions, and will ultimately impact the trust the organization has with the data team.
The data quality firefighting trap
Data quality issues constantly appear, and fixing them is important. Naturally, every time the team is alerted to an issue, resolving it becomes a high priority. Unfortunately, resolving the problem can be time consuming:
- First, someone needs to work out where in your data stack the issue was introduced
- Next, they need to find a way to resolve the problem.
- Finally, the impact of the issue needs to be understood and communicated to impacted teams
If you don’t get on top of data quality early, this firefighting process can take up a significant chunk of your team’s time. Instead, you need to find a way to build data quality into your processes and workflows.
What to do about bad data?
Clearly, it’s important that you find a way to stay on top of data quality. Thankfully, there’s a solution available that allows you to keep on top of data quality while staying agile: data tests.
For decades, software engineers have been writing tests for their code. Now, data teams are taking inspiration from them. In software engineering, whenever a new piece of functionality is added to a piece of software, the engineer also adds tests. These tests assert that the functionality does what it’s supposed to do. More importantly, the tests prevent unexpected regressions in behaviour caused by updates to the code. If any tests fail, the engineer knows something is wrong!
In a similar way, modern data teams can use data tests to bring stability to their “codebase”.
What is a data test?
There are two types of tests that can be useful when working with data: assertions and unit tests.
Assertions: assertions are runtime checks which assert that real data meets some set of criteria. For example:
- Revenue should never be negative
- A
user_id
field should never be NULL - A user can’t have a
signup_date
before the company was founded
Unit tests: unit tests are used to check whether a particular data transformation does what you’d expect. Unit tests specify some (usually fake) input data, and the expected output data of the transformation. If the two don’t match, something is wrong with the transformation code.
While unit tests can be useful in some circumstances, assertions are generally a more useful method of testing data quality, and we’ll focus on them for the rest of this article. You can read more about unit testing here.
Creating assertions
The most common way to define assertions over a dataset is to write a query to select incorrect data. This is best demonstrated with an example:
Imagine you are creating a table named customers that contains all the information you have gathered together from various sources about your customers. The table contains a field called user_id, which you know should never be NULL
. If it is NULL
, something has gone wrong. In SQL, you would write an assertion for this like so:
SELECT * FROM customers WHERE customer_id_id IS NULL;
If the query returns 0 rows, it means that your test passed. You can be confident that your table follows the rule that says that there is “only one row per customer”. If rows are returned, it means you need to fix some data quality issues.
Automated testing and failure notifications
Data tests alone won’t fix your data quality problems. Once data tests are defined, they need to be executed as part of your development process: any failing tests should be flagged to your team whenever they’re found. There are two instances when you can be notified of test failures:
- During development: One of the team is introducing some updates to an existing transformation. Before making those changes “live”, all relevant tests should be executed. If any fail, there are likely bugs in the new code, and the developer can fix them there and then.
- During pipeline execution: Your pipelines are likely being updated regularly, perhaps as often as every hour. Each time there is an update, there’s an opportunity for some bad data to cause a test to fail. Ensuring your team is notified whenever this happens means you can respond immediately, minimising impact on the business.
What data should you test?
The answer here is pretty simple: all of it! Writing assertions takes time, and data teams are usually resource constrained. However, time spent on writing tests now prevents more time being wasted debugging data quality issues in the future. The following parts of your data pipeline are the areas where you’ll probably see the most return on investment on time spent writing assertions.
Flaky source data
The foundation of the data model is your source data. Typically this is not owned by the data team: it may come from your operational databases, third party providers like Salesforce or Stripe, or even a spreadsheet owned by your finance department.
If the input data is incorrect, your entire data model will suffer. Spending time writing assertions which validate the correctness of your data sources will alert you as soon as the problem arises, so you can focus on resolving the bugs.
Complex data transformations
Your data model will probably contain hundreds of transformations, but there will probably only be a handful that are particularly complex. Spend some time writing assertions to make sure they’re doing what you’d expect! Some examples:
- Sessionization: Make sure you don’t have any sessions longer than 10 days.
- Monthly Recurring Revenue: Ensure that MRR is never above a particular threshold (that you know is very unlikely to be true).
- Average Order Value: Should never be negative!
High visibility output data
In most organizations, there are several datasets that are particularly important to the running of your business. If they contain incorrect data, they may cause the whole business to make wrong decisions. Pinpoint those high visibility, high impact datasets, and make sure you’ve defined as many assertions on them as you can think of.
What aspects of data quality should you be considering when writing assertions?
5 dimensions of data quality
We’ve discussed the areas of your data model where assertions can provide the most value: but what assertions should you be writing?
A lot has been written about data quality, but it’s typically useful to consider these 5 dimensions:
- Uniqueness: Are there any duplicate entries within your dataset?
- Validity: Does the dataset conform with expected business rules?
- Completeness: Does the dataset contain all of the data it should?
- Timeliness: Is the data available at the time it should be?
- Consistency: Is data across the warehouse consistent?
They all sound pretty sensible, but practically speaking, how can we write assertions to test these dimensions?
Uniqueness
Most datasets have a field (or combination of fields) by which they should be unique: often called the primary key, or grain, of the dataset. To check for uniqueness, the assertion should look for rows that share a duplicated primary key.
For a dataset where customer_id is the primary key:
SELECT
customer_id,
SUM
(
1
)
AS
rows_per_customer_id
FROM
customers
GROUP
BY
1
HAVING
rows_per_customer_id >
1
Validity
For many of the fields in your dataset, you’ll know what format you expect the data to be in. The most simple example is that some fields should always be present (i.e. not NULL). You may also, for example, know that a numerical value should always be non-negative.
Assertions checking the validity of this data would typically follow this format:
SELECT
*
FROM
customers
WHERE
customer_id IS
NULL
OR
email IS
NULL
OR
lifetime_orders <
0
Completeness
Completeness assertions look for gaps in your data. For example, it’s common to have an ETL system that loads daily data dumps into the warehouse. If data is missing for a given day, that could indicate a problem.
To look out for issues like this, we want to write an assertion that checks that there is some data present for each day of data in the last 30 complete days. The following assertion checks if there are any missing load_date_s in _source_table in the last 30 days:
WITH
last_30_days AS
(
SELECT
check_date
FROM
all_dates
WHERE
date
>=
TIMESTAMPS_SUB(
CURRENT_DATE
(
)
,
INTERVAL
30
day
)
,
source_table_dates AS
(
SELECT
load_date,
COUNT
(
1
)
AS
rows
FROM
source_table
GROUP
BY
1
)
SELECT
last_30_days.
date
,
source_table_dates.
load_date,
rows
FROM
last_30_days
LEFT
JOIN
source_table_dates
ON
last_30_days.
date
=
source_table_dates.
load_date
WHERE
source_table_dates.
load_date IS
NULL
Knowing exactly where to look for those gaps requires a good understanding of your data, and the corresponding assertions will vary accordingly. The example above hopefully provides inspiration.
Timeliness
Your datasets will all be built on top of source data being loaded into your warehouse by a data integration system. Timeliness alerts will let you know if data has stopped flowing into your data warehouse.
The following assertion checks how many hours have passed since the last load_timestamp. If it’s more than 6, the assertion fails.
WITH
freshness AS
(
SELECT
timestamp_diff(
CURRENT_TIMESTAMP
(
)
,
MAX
(
load_timestamp)
,
"hour"
)
AS
delay
FROM
production_database )
SELECT
*
FROM
freshness
WHERE
delay >
6
Consistency
We want to make sure that the data in one table is consistent with the table in another. For this, we can write assertions that compare two (or more) tables.
The following assertion checks if the total revenue for the last 28 days is the same in country_revenue_stats and product_revenue_stats:
WITH
revenue_comparison AS
(
SELECT
(
SELECT
SUM
(
revenue_usd)
FROM
country_revenue_stats)
AS
country_revenue,
(
SELECT
SUM
(
revenue_usd)
FROM
product_revenue_stats)
AS
product_revenue
)
SELECT
*
FROM
revenue_comparison
WHERE
country_revenue <>
product_revenue
Using Dataform’s testing framework to automate assertions
Assertions are only effective if they are run on a regular basis, with alerts being sent to the team in the case of failures. Dataforn, the data modelling platform built for modern data teams, comes with assertions out of the box.
Setting up some of the assertions mentioned above in Dataform, using the SQLX language, requires only a few lines of code:
config {
type:
“view”,
assertions:
{
uniqueKey:
[
“customer_id”]
,
nonNull:
[
“customer_id”,
“email”]
,
rowConditions:
[
“lifetime_orders >
0
”]
}
select
...
More advanced assertions can be defined in a standalone SQLX file:
config {
type:
“assertion”
}
WITH
revenue_comparison AS
(
SELECT
(
SELECT
SUM
(
revenue_usd)
FROM
country_revenue_stats)
AS
country_revenue,
(
SELECT
SUM
(
revenue_usd)
FROM
product_revenue_stats)
AS
product_revenue
)
SELECT
*
FROM
revenue_comparison
WHERE
country_revenue <
>
product_revenue
These assertions are executed whenever your datasets are updated as part of an automated schedule. If there’s a failure, you’ll be notified by email or via a slack message, with an error message pointing you to the specific test that failed.
Summary
Data teams have never had more responsibility within the organization than today. Business decisions – both big and small – depend on the information collected, transformed, analysed and shared by the data team. Thus, it’s incredibly important to ensure that the data is correct.
By adopting a framework with testing at its core, like Dataform, automated data tests can alert you to discrepancies in real time, during development. This not only prevents your organization from making bad decisions, but it also frees up time for your team to focus on value-add work, rather than responding to complaints of inaccurate data.