Data quality assessment – what makes good data? | DS Stream
The data quality management process consists of many steps. In fact, the data you acquire for your company have to be analyzed and prepared properly before being useful for business intelligence or other purposes. From this article, you will learn what data quality is, what makes data good, and how to take care of data quality management.
You don’t really know much about data quality management yet, but you are determined to improve the quality of your business insights? Great! You are in the right place! From this article, you will learn more about the dimensions of data quality – in short, we’ll describe what datasets should be like to be considered high quality. We’ll also give you some hints on what data quality management tools you can use in your business.
Mục lục
What is Data Quality?
Data quality should be understood as the degree of both the correctness and the usefulness of data. Data quality assessment is an important part of the data quality management process. Measures of data quality are based on data quality characteristics and of course business success thanks to deriving necessary insights. If there is a mismatch in the data stream, we should know how to identify the particular corrupt data. After that, we have to identify data errors that need resolution and assess whether the data in our IT systems are fit to serve the intended purpose. Data quality problems may doom the success of many projects, leading to added expenses, lost sales opportunities, or fines for improper financial or regulatory compliance reporting in any field, such as banking, researching, automotive or medicine. This is why constant data quality control is so very important and why it is worth acquainting yourself with data validation checks, techniques and tools in the following overview.
Basics: The Dimensions of Data Quality
There are six main data quality metrics for measuring the quality of your business information: accuracy, completeness, consistency, validity, uniqueness, and timeliness. Let’s look at them one by one:
Data accuracy
Data accuracy refers to the degree to which data correctly represent the “real-life” objects they are intended to model. In many cases, accuracy is measured by how the values agree with an identified source of correct information (such as reference data). This data quality metrics\ is actually quite challenging to monitor, not just because one requires a secondary source for corroboration, but because real-world information may change over time. A classic example of the accuracy issue is USA vs EU date formats like MM/DD/YYYY vs DD/MM/YYYY. Believe it or not, this is still a common problem which can render this kind of data useless. Data accuracy is quite literally the most important characteristic that makes data usable and purposeful.
Data completeness
Data completeness refers to the comprehensiveness or wholeness of the data. For data to be truly complete, there should be no gaps or missing information. Sometimes incomplete data is unusable, but is often still used anyway, even with missing information, which can lead to costly mistakes and false conclusions. Incomplete data is often a result of unsuccessfully collected data. For example: gathering contact details requires a name, a surname, and an e-mail and correct relation of this data between records. Incomplete data can lead to inconsistencies and errors that impact accuracy and reliability.
Data consistency
A strict definition of consistency specifies that two data values drawn from separate data sets must not conflict with each other. In other words: only valid data which corresponds between two data sets is saved. This may concern record-level consistency, cross-record consistency or temporal consistency. Note that consistency does not necessarily imply correctness. The most common example of data consistency is a broken backup.
Data validity
Validity is the most intuitive of all the data quality metrics– data should be collected according to defined business rules and parameters, while conforming to the right format and falling within the right range. It’s easy to understand that, for example, physical and biological entities and events have their limits of correctness and scales clearly stated, for example: body temperature, height, or life expectancy. Every value outside the scope of data is invalid.
Data uniqueness
The dimension of uniqueness demands that no entity exists more than once within the data set. Uniqueness ensures there are no duplications or overlapping of values across all data sets. Data cleansing and deduplication can help remedy a low uniqueness score. An example in which data uniqueness is vital is a phone number or personal ID number database.
Data timeliness
Timeliness – timely data is available when it is required. Data may be updated in real time to ensure that it is readily available and accessible. Timeliness can be measured as the time between when information is expected and when it is readily available for use. The success of business applications relying on master data depends on consistent and timely information. Therefore, service levels specifying how quickly the data must be propagated through the centralized repository should be defined so that compliance with those timeliness constraints can be measured. An example of when timeliness is of utmost importance is tracking the time of patient care events in the emergency room.
The summary of the above description is provided in the Data Quality Dimensions checklist:
Dimension:DefinitionReferenceTimelinessThe degree to which data represent reality from the required point in timeThe time the real world event being recorded occurredCompletenessThe proportion of stored data against the potential of “100%
complete”Business rules which define what “100% complete” representsUniquenessNothing will be recorded more than once based upon how that thing is identified.Data item measured against itself or its counterpart in another data set or databaseValidityData are valid if it conforms to the syntax (format, type, range) of its definitionDatabase, metadata or documentation rules as to allowable types (string, integer, floating point), the format (length, number of digits) and range (minimum, maximum or contained within a set of allowable values)ConsistencyThe absence of difference, when comparing two or more representations of a thing against a definitionData item measured against itself or its counterpart in another data set or databaseAccuracyThe degree to which data correctly describes the “real world” object or event being described.Ideally the “real world” truth is established through primary research.
Dimension:MeasureScopeTimelinessTime difference
[Unite of measure: time]Any date item, record, data set or databaseCompletenessA measure of the absence of blank (null) values or the presence of non-blank values
[Unite of measure: percentage]0-100% of critical data to be measured in any data item, record, data set or databaseUniquenessAnalysis of the number of things as assessed in the “real world” compared to the number of records of things in the data set. [Unite of measure: percentage]Measured against all records within a single data setValidityComparison between the data and the metadata or documentation for the data item
[Unite of measure: Percentage of data items deemed Valid or Invalid]All data can typically be measured for Validity. Validity applies at the data item level and record level (for combinations of valid values)ConsistencyAnalysis of pattern and/or value frequency [Unite of measure: percentage]Assessment of things across multiple data sets and/or assessment of values or formats across records, data sets and databasesAccuracyThe degree to which the data mirrors the characteristics of the real world object or objects it represents. [Unite of measure: The percentage of data entries that pass the data accuracy rules. ]Any “real world” object or objects that may be characterized or described by data, held as data item, record, data set or database.
Dimension:Related dimensionsExampleTimelinessAccuracy because it inevitably decays with timeAny date item, record, data set or databaseCompletenessValidity and Accuracygathering contact details require name, surname and e-mail and correct relation of this data between recordseUniquenessConsistencyPercentage of any duplicate records in datasetValidityAccuracy, Completeness, Consistency, and UniquenessEvery value which includes in the scope of dataConsistencyAccuracy, Validity, and UniquenessSaddly, broken backupAccuracyValidity, Uniqueness, Consistencyis USA vs EU data format like MM/DD/YYYY vs DD/MM/YYYY
How to determine data quality
Data quality assessment is not an easy task. It requires understanding the data quality metrics, so you need to hire experienced and talented data quality experts. You have two options – you can invest in data quality services or take care of this in-house with your own team of data quality experts.
If you possess basic knowledge about the dimensions of data quality, you can go deeper into determining data quality. Your first goal is to determine the condition of the dataset by performing data asset inventories in which the relative accuracy, uniqueness, and validity of your data are measured in baseline studies. The established baseline ratings for data sets can then be compared against the data in your systems on an ongoing basis to help identify new data quality issues so they can be resolved.
The second step of data quality management is creating a set of data quality rules based on business requirements, which will be used for assessing if your data are good enough or in need of fixing.. Such rules specify required quality levels in data sets and detail what different data elements need to be included so they can be checked for data quality attributes.
But what do we do when we discover poor data quality?
What is a data cleaning?
Datasets can be processed by: data cleansing, or data scrubbing. This is a very important part of the data quality management process, whose main goal is to fix data errors, while working to enhance data sets by adding missing values, more up-to-date information or additional records. Based on the amount of data sets, these can be screened out by means of a value in every record or by checking metadata like the amount or order of the headers, columns, and row numbers, among many others. All of these operations can be performed by using dedicated tools and techniques, examples of which are included in the next paragraph.
Data quality management tools and techniques
Specialized software tools for data quality management can match records, delete duplicates, validate new data, establish remediation policies and identify personal data in data sets; they also do data profiling to collect information about data sets and identify possible outlier values. Such tools enable companies to perform efficient data quality monitoring, so it is good for you to learn what kind of solutions are available.
Great Expectation Library
One data quality solution, for example, is the Great Expectation Library (https://greatexpectations.io/expectations/) which defines itself by its slogan: Always know what to expect from your data. Great Expectations also helps data teams eliminate pipeline debt through data testing, documentation, and profiling. This means that the library delivers confidence, integrity, and acceleration to data science and data engineering teams by covering all kinds of common data issues, including:
expect_column_values_to_not_be_null
expect_column_values_to_match_regex
expect_column_values_to_be_unique
expect_table_row_count_to_be_between
expect_column_median_to_be_between
For more about these techniques, on the official Great Expectations website, there are a few case studies of the experiences of different companies and teams. (click!)
An alternative approach to data quality management
Data Validation algorithms offer another way to determine Data Quality:
On MultiTech’s Medium account there is a proposition for Big Data Migration Workloads in Apache Spark which includes Big Data validation. Big Data refers to a huge volume of data that cannot be stored and processed using a traditional computing approach within a given time frame. In numerical terms, this means processing gigabytes, terabytes, petabytes, exabytes or even larger amounts of data. In this context, techniques must be more suitable to problems. That’s why the presented algorithm includes:
-Row and Column count
-Checking Column names
-Checking Subset Data without Hashing
-Statistics Comparison
-Hash Validation on entire data
These are foregrounded because tracking and reporting on data quality enables a better understanding of data accuracy. Furthermore, the processes and tools used to generate this information should be highly useful and do automated functions where possible.
Summary
This blog post is an introduction to the world of data quality and data quality assessment. It includes a description of six data quality dimensions: accuracy, completes, consistency, validity, uniqueness, and timeliness, with their definitions, examples, and descriptions of the relations between them. Based on the presented indicators, methods of determining data quality has been discussed, as well as three basic ways of handling poor data quality: by processing, using software tools, and applying dedicated algorithms.
Are you considering leveraging professional data quality services? We can help you with data quality assessment and ensure the highest quality of your business data. Contact us if your company requires efficient data quality management solutions. For more articles, please follow our blog.
Check out our blog for more details on Data Pipeline solutions: