A Comprehensive Framework for Data Quality Management
Mục lục
A Comprehensive Framework for Data Quality Management
How to monitor and maintain Data Quality to make sure the data meets certain standards for specific business use-cases
Photo by Stephen Dawson on Unsplash
Introduction
Big Data is becoming popular and it has been considered as the new oil of every industry these days. In fact, it is believed that there are some levels of similarity between big data and oil in the way they are mined, stored, transported and consumed.
Photo by Zbynek Burival on Unsplash
As a car owner, you probably understand the importance of the purity of petrol when you stop by the gas station every week. Petrol with low purity might seriously damage your car over time even though it can still keep your car moving. However, someone with a diesel engine on their truck might have less concern than you are as the expectation of purity is lower for diesel fuel than it is for petrol. Overall, different engines or applications require different levels of fuel quality.
Similarly, data with poor quality can create negative impacts on your organisation although you can still have some insights from it. However, different use-cases might have different expectations for data quality since aiming for the highest quality of data is always costly. This is also captured in the definition of data quality: “Data quality shows the extent to which data meets users’ standards of excellence or expectations”[3]
When people think about data quality, they often refer to the term accuracy. For example, “Mark Zuckerberg” is the correct name of Facebook’s CEO but in a database, we might see something like “Mar Zuckerberg”, “Mark Zuckerber” or “Mark Zuckeberg” which are inaccurate representations of the original name. However, when it comes to data quality, accuracy alone is not enough to evaluate the data quality as the expectations are also on completeness, timeliness, consistency and uniqueness. These are five different dimensions to measure the quality of data.
The remainder of this article is organised as follows. In Section 1, we introduce the definition and example of fives Data Quality dimensions including Accuracy, Completeness, Timeliness, Consistency and Uniqueness. Section 2 will provide the formulas for metrics calculation while Section 3 offers an overview of the implementation of the metrics with data validation rules. Section 4 describes the process of managing Data Quality and Section 5 outlines the responsibilities of all parties involving in this process.
1. Data Quality Dimensions
1.1 Accuracy
Accuracy is defined as the closeness between a value to its correct representation of the real-life phenomenon. There are two types of accuracy that can be identified: Semantic Accuracy and Syntactic Accuracy[1].
Accuracy is often known in the concept of semantic accuracy which is the closeness of the value v to the true value v’. For example, when observing the records in the database, a developer found this {OS: Windows, Creator: Linus Torvalds}, {OS: Linux, Creator: Bill Gates}. He then knew immediately that there is a semantic issue with these records as the creators should be Linus Torvalds for Linux and Bill Gates for Windows.
Another form of accuracy is Syntactic Accuracy which is the syntactical closeness of the value v to the accepted values in a specific domain. Instead of comparing v to a true value v’, we are more interested to know if the value is syntactically correct or not. For example, a credit card number should have 16 digits XXXX XXXX XXXX XXXX and if something like 1234 1234 1234 1234 0 appears on the database, we will then have a syntactic accuracy issue.
Semantic Accuracy is stricter in general but the challenge for measuring this type of accuracy is that the true value v’ must be known beforehand. In the example above, everybody knows the correct name of Facebook’s CEO and hence semantic accuracy can be easily evaluated in this scenario. However, for something like transaction values, a customer purchased a $300 watch, for example, we must know the real value of the transaction and have a ground truth to compare against. This might not be the case as sometimes we only have one source of data and even if we have another source for comparison, we must have a mechanism to link objects in two data sources together (object identification).
Syntactic Accuracy, on the other hand, is less complex to calculate and does not require a prior true value. As a result, it is practically more applicable as compared to semantic accuracy. Nevertheless, there are some other use-cases where semantic accuracy can be applicable. In Australia, ABS data can be considered the ground truth for national information and therefore it can be used to verify the semantic accuracy of other data sources in many different areas such as population, economy and income.
1.2 Completeness
Three types of completeness (Figure created by me)
Another aspect of data quality is completeness at which “the extent to which data are of sufficient breadth, depth, and scope for the task at hand”[1] is evaluated. The completeness is often related to the number of missing values in a record (tuple completeness), columns (column completeness) or table (relation completeness).
1.3 Timeliness
Timeliness indicates the freshness of the data for a specific application. There are two elements contributing to this indicator: Currency and Volatility. Currency reflects the freshness of data. For example, if a customer has recently changed their address, how will the record be updated. The Currency is considered as high if the record is updated right after the customer’s change request and vice versa. Timeliness also depends on Volatility which shows the level of data variation in time. Data updated daily will have higher volatility as compared to those refreshed monthly. For example, Date of Birth might have zero volatility whilst stock price would have extremely high volatility. Volatility can also be expressed as the length of time the data remains valid.
Currency and Volatility can affect each other in term of their contributions to timeliness. Highly volatile data should be current whereas data with low volatility does not require high currency. A five days delay will be a problem for daily updated data whilst it might not be the case for monthly or annually updated ones.
1.4 Consistency
Consistency indicates the level of compliance for integrity constraints between two or more columns from the same or different table. For example, if the country code is AU, the state column should have one of the following values: VIC, NSW, WA, QLD, TAS, NT, ACT, SA.
1.5 Uniqueness
This dimension captures the number of duplicated records. However, there are many different levels of duplications and each requires different methods to measure.
First form of duplication (Figure created by me)
The most basic form of duplications is when we have two records storing exactly the same value for each cell. This is less likely to happen if the key column is used as a primary key. Nevertheless, sometimes a database system can choose to use a generated key instead of an existing meaningful column in the data. In that case, it will be slightly more complex to detect duplication as we cannot just rely on the key. The naive approach would be to compare the values cell by cell and this method is fairly easy to implement.
Second form of duplication (Figure created by me)
The next form of duplications is when we have two records storing not exactly the same values but referring to the same object or entity in real life. The discrepancy might simply be caused by typos during data input or it could be due to different data format from different source systems. With this kind of duplication, the naive approach above is not applicable. The generic and more advanced approach is to use text distance metrics such as TF-IDF (Term Frequency — Inverse Document Frequency) or Cosine similarity to measure the similarity between two records. This method requires more computation although it is smarter and will work for various scenarios.
2. Metrics Calculation
Given a relation R with K attributes and N records, the metrics are calculated as follows:
Formulas for metrics calculation (Figure created by me)
- Metrics can vary from 0 to 1 (0 is poor, 1 is excellent).
- B is a Boolean function returning 1 if the condition is met
- vij is the value at column j and row i
- For Timeliness metric, T is the Age of the data and F is the Frequency of the data.
3. Metrics Implementation
The data quality framework will be built on top of the existing Data Validation Framework where all the data validity rules are implemented. Data Validation Framework is a set of methodologies and software libraries containing tools to implement validity rules. For example, the Data Validation Framework can provide methods or functions to check if numbers within a column are in a specific range. A use-case for this could be to make sure that the Age column does not contain negative values. To the best of my knowledge, there is no widely-used or industry-standard data validation framework generically built to be used in various business use-case. However, the Methodology for Data Validation from European Commission [5] is one of the most concise documents about Data Validation at multiple levels. I defer interested readers to [5] for more detailed information about Data Validation and we will discuss this topic in another article.
Data Validation, Data Quality and Business (Figure created by me)
On the high level, Data Quality Framework will act as a governance layer connecting the data quality aspects to the criticality of the business use-cases. There will be an iterative process to continuously maintain and improve the quality of data to meet the user’s expectations which we are going to discuss in the next section.
Data Quality Dashboard (Figure created by me)
At the implementation level, the metrics will be calculated by the data validation framework and stored into central storage for auditing and reporting. For a particular dataset, a dashboard will be provided to monitor the data quality. The alarms will also be sent if validity rules are violated.
4. Data Quality Management Process
The Data Quality Management Process has to be an iterative cycle as data quality needs to be continuously maintained and improved to meet user’s expectations. The process has two phases including five stages as follows:
Data Quality Management Process (Figure created by me)
The first stage mainly focuses on the discovery part of the dataset to understand the impact of low-quality data on the business. It consists of top-down and bottom-up approaches. The top-down approach collects user requirements, use-cases information and pain points when using the data whilst the bottom-up approach looks closer to the content of the data and employs statistical analysis to identify potential data quality issues.
In the second stage, information from both top-down and bottom-up approach are combined to define the threshold for each of the five metrics. Business-related rules should also be identified for the data.
After that, these rules will be mapped into validity rules in stage three to be implemented later using Data Validation Framework in stage four. Also in stage three, solutions for any data quality issue detected in the first stage will be provided. Finally, all the designs need to be aligned with the Interface Agreements so that any violation against the data standard declared in the Interface Agreement should be detected.
Stage four will be the actual implementation of the designs using Data Validation Framework whilst stage five relates to the operation side when the data feed is live. Data quality will be continuously monitored and a process of incident reporting has to be in place to trigger the next cycle of the data quality management process.
As mentioned above, this is an iterative process and in general, there are two possible methods to trigger a new cycle. The first method (preventive approach) is to periodically review the performance and repeat the cycle after a fixed period of time. The second method (reactive approach) is to start a new cycle when a data quality issue or an incident has been raised. Both methods can also be applied at the same time depending on the use-case and application.
5. Responsibilities and Tools
This section outlines the collaboration between Data Engineering Squad and Data Consumers (Streams) on Data Quality matters. Throughout the document, it can be seen that the expectation of data quality for a specific use-case is the starting point for the process of monitoring and improving the quality of data. Therefore, Data Quality Management is a Business-Centric (or User-Centric) process where requirements and expectations from business are the core of the operating model. The model can be illustrated as follows:
Operation model (Figure created by me)
Within this model, Streams will be at the forefront interacting with the clients to gather requirements and design the use-cases around the data. Streams are in charge of stage one and stage two in the data quality management process since they understand how the data is used and know how the data should “look like”.
In stage three, four and five, the responsibilities are, to some extent, shared between Data Engineering Squad (DE Squad) and Streams. DE Squad provides tools needed for the implementation of validity rules via Data Validation Framework as well as maintaining and improving this framework to meet the requirements from Streams. DE Squad is also responsible for the dashboard and alarming solution that Streams can easily set up for their specific data sources. On the other hand, Streams use the tools provided by DE Squad to apply the metrics and validation rules as well as config the alarm and dashboard with support from DE Squad.
Data Quality review can be performed by Streams and they can request DE Squad to add new features into the framework to meet the business use-cases requirements. Framework-related Incidents raised by Streams will be handled by DE Squad within an acceptable SLA.
6. Summary
In this article, we have discussed the definition of Data Quality, five dimensions of Data Quality including Accuracy, Completeness, Timeliness, Consistency and Uniqueness. We have walked through the calculation of the metrics, the high-level implementation, the process of managing Data Quality and the responsibilities of all parties involving in this process. This framework provides not only a foundation on which companies can develop and enhance to meet their business requirements but also an iterative process to maintain and improve Data Quality with both preventive and reactive approach.
I encourage the active readers to share the current data-quality-related challenges in your organisation, how you managed to solve these problems and the brainstorming about potential use-cases where this framework can help tackle your data quality management issues in a different way.
References
This framework is adopted from the following documents with slight modifications to meet the current practical requirements.
[1] C. Batini and M. Scannapieco, Data Quality Concepts, Methodologies and Techniques, Springer, 2006.
[2] D. Loshin, The Practitioner’s Guide to Data Quality Improvement, Morgan Kaufmann, 2010.
[3] “Data Quality Management: Roles, Processes, Tools,” 2019. [Online]. altexsoft.com
[4] I. Mikhailouskaya, “Your Guide to Data Quality Management,” 2018. [Online]. scnsoft.com
[5] Essnet Validat Foundation, 2016. Methodology for data validation 1.0. European Commission.