Data Quality Dimensions – DATAVERSITY
Data Quality dimensions are useful concepts for improving the quality of data assets. Although Data Quality dimensions have been promoted for many years, descriptions of how to actually use them have often been somewhat vague.
Data that is considered to be of high quality is consistent and unambiguous. Poor Data Quality results in inconsistent and ambiguous data — data from different sources may show different addresses, inconsistent preferences, etc. Poor Data Quality can be the result of merged databases or from new information being combined with old information, instead of having replaced it.
Data Quality dimensions compare with the way width, length, and height are used to express a physical object’s size. These Data Quality dimensions help us to understand Data Quality by its scale, and by comparing it to data measured against the same scale. Data Quality ensures an organization’s data can be processed and analyzed easily for any type of project.
When the data being used is of high quality, it can be used for AI projects, business intelligence, and a variety of analytics projects. If the data contains errors or inconsistent information, the results of any project cannot be trusted. The accuracy of Data Quality can be measured using Data Quality dimensions.
The concept of the Data Quality dimensions was first written about and published in 1996 by Professors Diane Storm and Richard Wang (Beyond Accuracy: What Data Quality Means to Data Consumers). They recognized 15 dimensions. In 2020, the Data Management Association (DAMA) developed a list containing 65 dimensions and subdimensions for Data Quality, ranging from “Ability” to “Identifiability” to “Volatility.”
Data Quality dimensions can be used to measure (or predict) the accuracy of data. This measurement system allows data stewards to monitor Data Quality, to develop minimum thresholds, and to eliminate the root causes of data inconsistencies. However, there is currently no established standard for these measurements. Each data steward has the option of developing their own measurement system. The process involves taking samples of the organization’s data to establish baselines.
The measurements associated with these dimensions work well in setting up automation systems, and can be used with rules added to the Data Quality tools being used. The various Data Quality dimensions typically include the same six core dimensions.
Mục lục
The Six Most Commonly Used Data Quality Dimensions
The six core dimensions are:
- Accuracy: This dimension measures data that attempts to model real-world objects or events. The data is often measured by comparing it with sources known to be correct. Ideally, accuracy is established with primary research, but third-party references are often used for purposes of comparison, to measure the accuracy. Consider a European school accepting applications for the next semester. In filling out the application, the European dating format should be used (day/month/year; for example 31/09/2021). An American parent, however, might fill out the form using the American dating format (09/31/2021). The American date stored in the database would be confusing to European staff and should be corrected.
- Completeness: All required records and values should be available with no missing information. With completeness, the stored data is compared with the goal of being 100% complete. Completeness does not measure accuracy or validity; it measures what information is missing. For example, an address on a membership form. If three forms out of 100 are missing addresses, the data, regarding addresses, is 97% complete.
- Consistency: This dimension is about a lack of difference when two or more data items are being compared. Items of data taken from multiple sources should not (in an ideal world) conflict with one another. (It should be noted that consistent data does not necessarily mean it is complete or accurate.) The consistency Data Quality dimension is measured against itself, although it can also be measured against its counterpart in another dataset or database. An example of consistency can be shown by a school’s database having a student’s date of birth showing the same format and value in both the school register and the documents sent from the school the student is transferring from.
- Timeliness: The data’s actual arrival time is measured against the predicted, or desired, arrival time. An example of this dimension might be a nurse who gives administration a change of address on March 1, and the information is entered into the database on March 3. Hospital guidelines suggest the data should be entered within two days, but the data entry is actually a day late. Timeliness would measure how often this happens and can be used to get more specific information on each instance of “lateness.” (Consider what would happen if air traffic controllers received a single daily download from the radar system, as opposed to observing air traffic in real time. Timeliness can be important.)
- Validity: This dimension measures how data conforms to pre-defined business rules. When these rules are applied, the data falls within defined parameters. For instance, a company assigns each employee an ID based on their last name, date of hire, and job classification. Joanna Blake has just started and has been given an ID reading “Blak12/21JA.” The “J” stands for janitor and the “A” stands for “all areas.” However, the database shows Joanna as Blak12/21JS because of a typo (the S means nothing and invalidates her security clearance). After Joanna explains the situation to her manager, the decision is made to give her physical keys, rather than turning the problem over to the IT department, which would run a validity test on the database. The validity test would not only correct Joanna’s ID, but mistakes made on other employee IDs, making the whole company run a little more smoothly.
- Uniqueness: This is designed to avoid the same data being stored in multiple locations. When data is unique, no record exists more than once within a database. Each record can be uniquely identified, with no redundant storage. The process is based on how data items are identified. In this case, the data is measured against itself (or maybe another database), as in, “Oh, look. Joe Blow has two files, and he should only have one.” Uniqueness is also compared to the real world. Let’s say a school has 100 students. But its data shows it has 108 students. Eight files have been duplicated. Not a big deal, but some of the duplicated files might be updated, while the original files were not. That could lead to some confusion.
While all six dimensions are generally considered important, organizations may determine some should be emphasized some more than others, particularly for certain industries. (Or, they might need one of the 65 dimensions and subdimensions created by DAMA.) For example, the financial industry places a higher value on validity, while the pharmaceutical industry prioritizes accuracy.
Complications
Many organizations do not communicate or define their data expectations when receiving data from other sources. Few provide clear, measurable expectations about the formatting or condition of data before it is sent to them. Without communicating clear expectations, it is not possible to measure the quality of the data as it is received.
When an organization does define its requirements, it is often regarding a project, with a focus on the kind of data needed and the format. As a result, data requirements are often focused on source-to-target mapping, modeling, and implementing business intelligence tools. Using the same data for different purposes can also cause problems. Each “purpose” may have different expectations. In some situations, data items from different sources may be in conflict.
Data Quality Tools
Data Quality can be examined with humans doing the review process, but this would be slow and tedious, with a strong possibility for human error. Because some Data Quality dimensions use a formulaic format, software tools can be used to automate an assessment of the Data Quality.
Each dimension contains underlying concepts and these concepts (and their associated metrics) allow for the development of formulas that computers can use. Gartner has provided a list of Data Quality tools that might be useful.
Data Quality Issues
Data Quality issues can waste time and reduce productivity. They can also damage customer satisfaction, or even result in penalties for regulatory noncompliance.
Poor Data Quality can also conceal opportunities from a business, or leave gaps in understanding its customer base. Nissan Europe, for example, was using customer data that was unreliable and spread out across a variety of disconnected systems, making it difficult to generate personalized advertising. By improving Data Quality, Nissan Europe now has a better understanding of their current and prospective customers, helping them to improve customer communications.
Poor Data Quality wastes time and energy, and manually correcting a database’s errors can be remarkably time consuming.
Image used under license from Shutterstock.com