Data Quality Management 101 – DATAVERSITY

Data Quality Management is necessary for dealing with the real challenge of low-quality data. Data Quality Management can stop the waste of time and energy required to deal with inaccurate data by manually reprocessing it. Low-quality data can hide problems in operations and make regulatory compliance a challenge.

Good Data Quality Management is essential for making sense of data. It helps in establishing a framework for the organization and supports rules for Data Quality.

Accurate, up-to-date data offers a clear image of the organization’s day-to-day operations. Poor quality can promote mistakes and errors, including unnecessary expenses and lost invoices. Accurate data promotes confidence in application results and reduces unnecessary costs.

Good Data Quality Management will build a foundation of useful information that helps in understanding the organization’s expenses and processes.

Poor quality data is recorded incorrectly at the beginning, is distorted during use or storage, or has become outdated. Other examples of poor data quality include:

  • Incomplete data
  • Inconsistent data
  • Duplicated data
  • Poorly defined data
  • Poorly organized data
  • Poor Data Security

What Is Data Quality Management?

Data Quality Management can be described as a group of practices used to maintain and access accurate information. Each step of handling the data must include efforts to support accuracy. It starts with acquiring the data, implementing it, distributing it, and analyzing it, with the goal of receiving high quality, error-free information.

Increasingly, businesses are using data to promote intelligent decision-making on marketing issues, product development, and communications strategies. High-quality data can normally be processed and analyzed more quickly than low-quality data. High-quality data leads to faster and better insights, and supports business intelligence gathering and analytics.

What Are Data Quality Tools?

A good Data Quality Management system makes use of tools that can help to improve an organization’s data trustworthiness. Data Quality tools are the processes and technologies for identifying, understanding and correcting flaws in data that support effective information governance across operational business processes and decision-making. The tools available include a range of functions, such as:

  • Data Cleansing: Used to correct unknown data types (reformatting), eliminate duplicated records, and improve substandard data representations. Data cleansing ensures the following of data standardization rules that are needed to enable analysis and insights from data sets. The data cleansing process also establishes hierarchies and makes data customizable to fit an organization’s unique data requirements.
  • Data Monitoring: A process that monitors and ensures that an organization’s Data Quality is developed, used, and maintained within an organization. This tool normally uses automation to monitor the quality of data. Typically, an organization develops its own key performance indicators (KPIs) and Data Quality metrics. The data monitoring process is used to measure these metrics and evaluate them against a configured Data Quality baseline. Most Data Quality monitoring systems are designed to alert data administrators when quality thresholds are not met.
  • Data Profiling: The process of data profiling can be used to establish trends, and help in discovering inconsistencies within the data. This process combines the monitoring and cleansing of data. Data profiling is used for:
    • Creating data relationships
    • Verifying available data against descriptions
    • Comparing the available data to a standard statistical baseline
  • Data Parsing: This tool is used to discover if data conforms to recognizable patterns. Data parsing based on patterns supports automated recognition, such as a telephone number’s area code or the parts of a human name.
  • Data Matching: It reduces data duplication and can improve data accuracy. It analyzes the duplication of data in all records coming from a single data source, identifying both exact and approximate matches. The process allows duplicate data to be removed manually.
  • Data Standardization: The transformation of data from a variety of sources and different formats into a uniform and consistent format. It repairs such things as inconsistent capitalization, acronyms, punctuation, and values located in the wrong fields. Data standardization helps ensure the stored data uses the same, consistent format.
  • Data Enrichment: The process of supplementing missing or incomplete data.

Data enrichment is done by combining data from another source. This is commonly done during data migrations, when customer information has become fragmented. The data taken from one system is used to supplement data from another.

What Are Data Quality Metrics?

Data Quality metrics have become very important for measuring and assessing the quality of an organization’s data. Using Data Quality metrics requires an understanding of the data, how it is processed, and the ways to measure the quality of data. In many cases, measuring data dimensions is used, but other methods are also listed. The different types of Data Quality metrics are:

  • Data Accuracy: A measure of the data’s accuracy.
  • Ratio of Data to Errors: Keeps a tally of known errors in a data set and compares them to the size of the data set.
  • Data Completeness: Data is complete when it fulfills the expectations of an organization. It indicates when there is enough to draw meaningful conclusions.
  • Number of Empty Values: This is a measure of the number of times an empty field exists in a data set. These empty fields often indicate information that has been placed in the wrong field, or is completely missing.
  • Data Consistency: Requires that data values taken from multiple sources do not conflict with each other. It should be noted data consistency does not necessarily mean the data is correct.
  • Data Time-to-Value: This measures the time it takes to gain useful insights from data.
  • Data Integrity: Refers to testing data to assure its compliance with the data procedures of an organization. Data integrity shows there are no unintended errors, and uses the appropriate data types.
  • Data Transformation Error Rate: This measures how often data transformation operations fail.
  • Timeliness: Tracks when data isn’t ready for users when they need it.
  • Data Storage Costs: When data is being stored without being used, the data can be considered quality data. If data storage costs decline, while data operations remain the same, or grow, it indicates the quality of the data may be improving.

What Is Data Quality Control?

Data Quality control is about controlling how data is used. The process is typically performed both “before and after” Data Quality assurance (the discovery of data inconsistency and their corrections).

Prior to the Data Quality assurance process, inputs are restricted and screened. After the quality assurance process, statistics are gathered from the following areas to influence the quality control process:

  • Accuracy
  • Incompleteness
  • Severity of inconsistency
  • Precision
  • Missing/Unknown

Information is taken from the quality assurance process, which is used by the Data quality control process to decide what to use. For example, if the quality control process discovers too many errors, it will block use of the data, rather than allow a disruption to take place.

What Are Data Quality Dimensions?

Data Quality dimensions support ways of measuring the quality of the data an organization uses. Use of multiple dimensions can show the level of an organization’s Data Quality. The aggregated scores taken from multiple dimensions provide a reasonable representation of the data’s quality and suggest the fitness of the data.

Data Quality dimensions measure the dimensions specific to the project’s needs.

The data can define what is considered an acceptable level (or score), in turn building more trust in the data. There are six dimensions of Data Quality that are commonly used:

  • Data Completeness: This dimension can be used to cover a variety of situations. For example, customer data may show the minimum amount of information needed for a productive customer interaction. Another example would be an order form lacking a delivery estimate, which would not qualify as complete. Completeness measures whether the data shown is sufficient to support a satisfactory interaction or transaction.
  • Data Accuracy: When data presents a realistic model of the real-world (or portions of it) and expectations, the data can be considered accurate. The closer to “the truth” the data is, the greater the data accuracy. An accurate phone number means that a person is reachable. Accuracy is especially critical for the more regulated industries, such as finance and healthcare. Measuring data accuracy requires verifying the data with authentic sources, such as state birth records, or by contacting the person or organization in question.
  • Data Consistency: This dimension focuses on whether the same information that is being stored in multiple instances is consistent. It is displayed as the percentage of data with matching information that is stored in various locations. Data consistency ensures that analytics correctly capture and leverage the value of data.

Data consistency can be difficult to assess, as it requires planned research across multiple data storage locations.

  • Data Validity: This measurement system determines if the values shown meet certain informational requirements. For instance, a ZIP code is valid if it contains the correct numbers for the region. Using business rules provides a method for assessing validity of data.
  • Data Uniqueness: It is used to determine whether a single record exists within storage, or if there are multiple versions of the same information. Multiple copies can cause problems, because some copies may not have received updates, or may simply be wrong.  Uniqueness ensures duplication is avoided.
  • Data Integrity: As data travels across different systems and is transformed, it can become distorted. Integrity indicates that the information and core attributes have been maintained. It ensures that data can be traced back to its original source.

Image used under license from Shutterstock.com