Data Quality Testing: 7 Essential Tests
Getting started with data quality testing? Here are the 7 must-have checks to improve data quality and ensure reliability for your most critical assets.
Data quality testing is the process of validating that key characteristics of a dataset match what is anticipated prior to its consumption.
According to Gartner, bad data costs organizations on average an estimated $12.9 million per year. In fact, Monte Carlo’s own research found that data engineers spend as much as 40% of their workday firefighting bad data.
Those are some big numbers. Data quality issues are some of the most pernicious challenges facing modern data teams, and testing is one of the very first steps a data team will take on their journey to reliable data.
Whether by mistake or entropy, anomalies are bound to occur as your data moves through your production pipelines. In this post, we’ll look at 7 essential data quality tests you need right now to validate your data, plus some of the ways you can apply data quality testing today to start building out your data quality motion.
Mục lục
So, what is data quality testing anyway?
When it comes to data engineering, quality issues are a fact of life. Like all software and data applications, ETL/ELT systems are prone to failure from time-to-time.
Among other factors, data pipelines are reliable if:
- The data is current, accurate, and complete.
- The data is unique and free from duplicates.
- The model is sound and represents reality.
- And the transformed data is free from anomalies.
While there’s no silver bullet for data quality issues, data quality testing empowers engineers to anticipate specific, known problems and write logic to proactively detect quality issues before they can impact downstream users.
So, now that we have a common understanding of what data quality testing is, let’s look at some of the most common data quality tests you can run right now and how they might be used to quality detect issues in your data.
NULL values test
One of the most common data quality issues will arise from missing data, also known as NULL values. (Oh, those pesky NULL values).
NULL values occur when a field is left blank, either intentionally or through a pipeline error, such as those caused by an API outage. Let’s say you were querying the impact of a marketing program on sales lift by region, but the ‘region’ field was left blank on multiple records. Any rows where ‘region’ was missing would necessarily be excluded from your report, leading to inefficient future spend on regional marketing programs. Not cool.
As the name implies, a NULL values test will validate whether values within a specified column for a particular model are missing after the model runs. One excellent out-of-the-box test for uncovering NULL values is dbt’s generic not_null test.
tests/test_not_null.sql
{% test not_null(model, column_name) %}
select *
from {{ model }}
where {{ column_name }} is null
{% endtest %}
Volume tests
Is data coming in? Is it too little? Too much? These are all data quality issues related to the volume of data entering your database.
Volume tests are a must-have quality check that can be used to validate the number of rows contained in critical tables.
Missing data
Let’s say your data platform processes data from temperature sensors, and one of those sensors fails. What happens? You may get a bunch of crazy temperature values—or you may get nothing at all.
Missing data can quickly skew a data model or dashboard, so it’s important for your data quality testing program to identify quickly when data volume has changed due to missing data. Volume tests will enable you to identify when data volumes have changed to uncover failure points and validate the accuracy of your data.
Too much data
Too much data might not sound like a problem (it is called big data afterall), but when rows populate out of proportion, it can slow model performance and increase compute costs. Monitoring data volume increases can help reduce costs and maintain the integrity of models by leveraging only clean high quality data that will drive impact for downstream users.
Volume SLIs
SLAs (service level agreements) are critical to a modern data reliability motion, and volume SLIs are the metrics that measure performance against a given SLA.
Volume tests can be used to measure SLIs by either monitoring table size or table growth relative to previous measurements. For example, if you were measuring absolute table size, you would could trigger an event when:
- The current total size (bytes or rows) decreases to a specific volume
- The current total size remains the same for a specific amount of time
Numeric distribution tests
Is my data within an accepted range? Are my values in-range within a given column?
These are questions that can be answered using distribution tests. In academic terms, a distribution test validates whether the data within a given table is representative of a normally distributed population. Essentially, does this data reflect reality?
These rules can easily be created in SQL by defining minimums and maximums for a given column.
One great example of an out-of-the-box distribution test is dbt’s accepted_values test, which allows the creator to define a range of acceptable distribution values for a given column.
Great Expectations also provides a library of common “unit tests”, which can be adapted for your distribution data quality testing. For example, here’s how you might ensure the zip_code column represents a valid zip code using Great Expectations unit tests:
expect_column_values_to_be_between(
column="zip_code",
min_value=1,
max_value=99999
)
Inaccurate data
In the same way that missing data can paint a false picture of reality, inaccurate data can be equally detrimental to data models. Inaccurate data refers to the distribution issues that arise from incorrectly represented datasets. Inaccurate data could be as simple as a doctor mistyping a patient’s weight or an SDR adding an extra zero to a revenue number.
Creating distribution monitors to identify inaccurate data is particularly important for industries with robust regulatory needs, like healthcare and financial institutions.
Data variety
Sometimes new values enter a table that fall outside a typical distribution. These values aren’t necessarily anomalous, but they could be. And when it comes to data quality, “could be” is usually something we want to keep an eye on.
Including distribution tests as part of your data quality testing effort can help proactively monitor for new and unique values to spot potentially anomalous data that could indicate bigger issues down the road.
Uniqueness tests
Another common quality issue that beleaguers data engineers and downstream users alike is duplicate data. Duplicate data is any data record that’s been copied and shared into another data record in your database.
Without proper data quality testing, duplicate data can wreak all kinds of havoc—from spamming leads and degrading personalization programs to needlessly driving up database costs and causing reputational damage (for instance, duplicate social security numbers or other user IDs).
Duplicate data can occur for a variety of reasons, from loose data aggregation processes to human typing errors—but it occurs most often when transferring data between systems.
Uniqueness tests enable data teams to programmatically identify duplicate records to clean and normalize raw data before entering the production warehouse.
If you’re using dbt, you can use the unique test to validate your data for duplicate records, but uniqueness tests are available easily out-of-the-box for a variety of tools depending on what you’ve integrated with your data stack.
Referential integrity tests
Referential integrity refers to the parent-child relationship between tables in a database. Also known as the primary key and the foreign key, the primary key is the root data that gets joined across tables to create models and derive insights.
But what happens if the data used for that primary key gets changed or deleted? That’s where referential integrity tests come in. Known as the relationships test in dbt, referential integrity tests ensure that any data reflected in a child table has a corresponding parent table.
Let’s say your marketing team is pulling a list of customer ids to create a personalization campaign for the holidays. How does your marketing team know those customer IDs map back to real people with names and addresses? Referential integrity data quality testing ensures that no changes can be made to a parent or primary key without sharing those same changes across dependent tables.
String patterns
In today’s distributed world, it’s not uncommon to find discrepancies in your data from all kinds of human error. Maybe a prospect forgot a character in their email address or an analyst accidentally changed a row without realizing it. Who knows!
Because data inconsistencies are fairly common, it’s important that those records are reconciled regularly via data quality testing to ensure that your data stays clean and accurate.
Utilizing a string-searching algorithm like RegEx is an excellent way to validate that strings in a column match a particular pattern. String patterns can be used to validate a variety of common patterns like UUIDs, phone numbers, emails, numbers, escape characters, dates, etc.
Freshness checks
All data definitely has a shelf life. When data is being refreshed at a regular cadence, the data paints an accurate picture of the data source. But when data becomes stale or outdated, it ceases to be reliable, and therefore, useful for downstream consumers.
And you can always count on your downstream consumers to notice when their dashboards aren’t being refreshed.
Freshness checks validate the quality of data within a table by monitoring how frequently that data is updated against predefined latency rules, such as when you expect an ingestion job to load on any given day.
Freshness tests can be created manually using SQL rules, or natively within certain ETL tools like the dbt source freshness command.
Freshness SLIs
In the same way you would write a SQL rule to verify volume SLIs, you can create a SQL rule to verify the freshness of your data. In this case, the SLI would be something like “hours since dataset refreshed.”
Like your SLI for volume, what threshold you choose for your SQL rule will depend on the normal cadence of your batched (or streamed) data and what agreements you’ve made in your freshness SLA.
How data observability powers data quality at scale
Now that you have a few essential data quality tests in your data quality testing quiver, it’s time to get at it! But remember, data reliability is a journey. And data quality testing is just the first step on your path.
As your company’s data needs grow, your manual data quality testing program will likely struggle to keep up. And even the most comprehensive data quality testing program won’t be able to account for every possible issue. If data quality testing can cover what you know might happen to our data, we need a way to monitor and alert for what we don’t know might happen to your data (our unknown unknowns).
- Data quality issues that can be easily predicted. For these known unknowns, automated data quality testing and manual threshold setting should cover your bases.
- Data quality issues that cannot be easily predicted. These are your unknown unknowns. And as data pipelines become increasingly complex, this number will only grow.
Still, even with automated data quality testing, there’s extensive lift required to continue updating existing tests and thresholds, writing new ones, and deprecating old ones as your data ecosystem grows and data evolves. Over time, this process becomes tedious, time-consuming, and results in more technical debt that you’ll need to pay down later.
Here’s where data observability comes in.
Monte Carlo’s automated, end-to-end data observability platform enables you to scale basic tests for freshness, volume, and schema, as well as set custom checks for data quality metrics, such as distribution anomalies. Image courtesy of Monte Carlo.
In the same way that application engineering teams don’t exclusively use unit and integration testing to catch buggy code, data engineering teams need to take a similar approach by making data observability a key component of their stack.
Data observability is an organization’s ability to fully understand the health of the data in their systems. Data observability eliminates data downtime by applying best practices learned from DevOps to data pipeline observability. Data observability tools use automated monitoring, automated root cause analysis, data lineage and data health insights to detect, resolve, and prevent data anomalies. Data observability both scales data quality through automated freshness, volume, and schema monitoring and alerting, and gives data teams the ability to set custom rules for specific, unknown unknown data quality issues like field health and distribution anomalies.
After all, delivering high quality data is about more than making sure your tables are being updated or your distributions are falling in an acceptable range. It’s about empowering your data engineers to deliver the best and most accurate data for your business needs at each stage of the data life cycle.
If data reliability is on your roadmap next year, consider going beyond data quality testing and embracing an end-to-end approach to data quality.
Until then, here’s wishing you no data downtime!
Interested in scaling data quality testing with data observability? Reach out to Brandon and the rest of the Monte Carlo team to learn more about data observability.