Data Quality Testing – A Quick Checklist to Measure and Improve Data Quality – Data Ladder
Mục lục
Did you know?
More than 70% of revenue leaders in an InsideView Alignment Report 2020 rank data management as the highest priority, yet, a Harvard Business Review study estimates only 3 percent of companies’ data meets basic quality standards.
There is a major gap between what companies want in terms of data quality and what they are doing to fix it.
The first step to any data management plan is to test the quality of data and identify some of the core issues that lead to poor data quality. Here’s a quick guide-based checklist to help IT managers, business managers and decision-makers to analyze the quality of their data and what tools and frameworks can help them to make it accurate and reliable.
What is data quality and why does it matter?
What is data quality and why does it matter? Before we delve into the checklist, here’s a quick briefing on what is data quality and why it matters.
There is no specific definition of data quality and to give one would be to limit the scope of data itself. There are however benchmarks that can be used to assess the state of your data. For instance, data of high quality would mean:
Pre-requisites of data quality testing
01.
Purpose of your data
What do you want to achieve with your data?
02.
Data quality metrics
What high-quality data means to you?
You must understand the metrics that will help you to measure data quality. This could be as simple as the ten critical data quality dimensions that we all know so well. But it is better if you make this a bit more specific to your use case. For example, the Date column in a dataset should contain formatted dates only. But you could also have dates that are actually garbage values since they represent dates that are too old to be accurate. So, you could have your own, more specific definition of what accurate, complete, consistent, valid, timely, and unique means to your company.
03.
Metadata of data fields
What is the correct definition and structure of each data attribute in your dataset?
This is probably the most important information that you need prior to your data quality testing process. Metadata is the information that describes your data. It helps you to understand the descriptive and structural definition of each data field in your dataset, and hence measure its impact and quality.
Examples of metadata include the data’s creation date and time, the purpose of data, source of data, process used to create the data, creator’s name and so on. Metadata allows you to define why a data field is being captured in your dataset, its purpose, acceptable value range, appropriate channel and time for creation, etc., and use that while testing and measuring data for quality.
How do you check the quality of your data?
Level 1:
Quick fact-checking of data values
Since data is being captured from our surroundings, we can quickly validate its accuracy by comparing it with known truth. For example, does Age column contain any negative values; are required Name fields set to null; do Address field values represent real addresses; does Date column contain correctly formatted dates; and so on.
This level of testing can be performed by generating a quick data profile of your dataset. It is a simple compare and label test where your dataset values are compared against your defined validations and some known/correct values, and classified as valid or non-valid. Although it can be done manually, you can also use an automated tool that will a run a quick profile test and show you where your data stands as compared to the validation rules defined.
But keep in mind that this level only tests the data itself, and not the metadata.
Level 2:
Holistic analysis of the dataset
i.
Vertical testing
It means computing the statistical distribution of each data attribute, and validating that all values are following the distribution. This allows you to continuously keep in check that the nature of new, incoming data is the same as the data residing within your dataset.
Furthermore, for this type of testing, you can determine the median and average values for each distribution, and set minimum and maximum thresholds. On every new entry to the dataset, you can check the probability that the new data belongs to this distribution. If the probability is high enough (approx. 95% or more), you can conclude that the data is valid and accurate.
You can also use the metadata of an attribute to compute distribution and test incoming data against it. For example, the Name field usually contains 7-15 number of characters. If a new Name entry has only 2 characters, it can be considered as a potential error as the new metadata value did not conform to the expected distribution.
ii.
Horizontal testing
It means performing a holistic analysis to qualify the uniqueness of each record in your dataset. For this type of testing, you need to go row by row in a dataset and verify that all records represent uniquely identifiable entities, and there are no duplicates present. This is a more complex form of testing as it might be difficult to assess uniqueness of a record in the absence of a unique key. For this purpose, advanced algorithms are utilized for performing fuzzy matching techniques and determining probabilistic matches.
Level 3:
Historical analysis of the dataset
Level 3 testing is the same as level 2, but instead of considering only current dataset, historical records are also used for computing row matches, and field distributions. This is done so that any changes in data that happen with time are also considered while validating data values.
For example, yearly sales are expected to spike at the end of the year due to holidays and are comparatively slower in the seasons leading up to it. So, you can end up drawing incorrect conclusions about your data if you don’t take time into consideration. With this level, you can also run tests for detecting anomalies in your data. This is done by looking at the history of values in a data attribute and classifying current values as normal or abnormal.
Using data quality testing tools and frameworks
01.
Manual QA/testing
How intelligent data management solutions transform operational efficiency
02.
Open-source libraries
03.
Coded solutions built in-house
Why in-house data quality projects fail
04.
Automated self-service tools
As data quality challenges become more complex, modern problems require modern solutions. Data scientists and data analysts are spending 80% of their time in testing data quality, and only 20% of the time in extracting business insights. Automated data quality testing tools leverage advanced algorithms to free you from manual labor of testing datasets for quality, or maintaining coded solutions over a period of time as data quality definitions evolve.
These tools are designed to be self-service and user-friendly so that anyone – business users, data analysts, IT managers – can generate quick data profiles as well as perform in-depth analysis of data quality through proprietary data matching techniques.
Normally, these tools specialize in offering two different types of testing engines – some come with only one and very few specialize in both types. Let’s take a look at them.
i.
Rules-based engines
Rules-based testing tools allow you to configure rules for validating datasets against your custom-defined data quality requirements. You can define rules for different dimensions of a data field. For example, its length, allowed formats and data types, acceptable range values, required patterns, and so on. These tools quickly profile your data against configured rules, and offer a concise data quality summary report which covers the results of the test.
ii.
Suggestion-based engines
Next course of action: Quality maintenance
01.
Employ data quality control for data integration
As new data enters into your ecosystem, the overall quality of your data deteriorates. This is why you need to implement data quality checks at the data entry or data integration level. You want to make sure that new data is introduced into the system is accurate and unique and is not a duplicate of any entity currently residing in your master record.
02.
Profile your data at regular intervals
03.
Fix root cause of identified errors
To conclude – test data quality before it gets too late
Most companies don’t engage in data quality tests unless critical for data migration or a merger, but at that time, it’s way too late to salvage the problems caused by poor data. Test your data quality, define the criteria, and set benchmarks to drive improvement.
Want to test your data quality? Give DME a try!
Luckily, you no longer have to put in the effort of manually testing your data as most ML-based data quality testing solutions today allow businesses to do that with a few easy steps. You’re choosing between 2 minutes vs 12 hours. And the choice doesn’t have to be daunting. Best-in-class solutions like DataMatch Enterprise allow free trials that you can benefit from. All you have to do is plug in your data source and let the software guide you through the process. You’ll be surprised at the hours and manual effort you’d be saving your team with an automated solution that also delivers more accurate results than manual methods.