Data quality dimensions in IBM Watson Knowledge Catalog
Mục lục
Data quality dimensions in IBM Watson Knowledge Catalog
Understand the standard data quality dimensions used by IBM Cloud Pak for Data and IBM Infosphere Information Server.
Image by saulhm from Pixabay (https://pixabay.com/users/saulhm-31267/)
In a previous article, I explained in details how IBM Cloud Pak for Data and IBM Infosphere Information Analyzer compute a unified data quality score for each analyzed dataset:
In short, the data quality score of a data set is computed by applying algorithms which look for different types of data quality issues. A data quality issue is identified, whenever data do not fulfil a given expectation. Such issues can be reported for individual cells of the data set or for complete rows or for columns or for the data set as a whole. The type of data quality issue is what is shown in the user interface as “Data Quality Dimension”
Data Quality Dimensions results in Watson Knowledge Catalog
The IBM ships a list of standard data quality dimensions with its products. These out of the box dimensions are automatically applied when data sets are analyzed in Watson Knowledge catalog or in Information Server. This can be done either in a data quality project looking in depth at the data quality of selected data assets or as part of a data discovery and ingestion process — look at my previous article “Unlock Your Data With IBM Watson Knowledge Catalog” to see how such a discovery process works.
In this new article I am going to explain each of these standard data quality dimension in details, what they are measuring and how expensive they are to execute.
The cost of a data quality dimension
Not all data quality dimensions are equally expensive. Some kind of data quality problems are rather easy to identify on a row by row basis, some other may require more work and multiple passes over the data and may have therefore a non neglictable impact on the time of the analysis.
The cost — in term of performance — of a data quality dimension is primary dependent on the number of passes over the data which are necessary to identify the searched data quality problems. Some data quality dimensions do not need to read the data because they can reuse statistics which are precomputed by data profiling before the data quality analysis happens. Most data quality dimensions require a single pass over the data. Some require to read the whole data set to compute some statistics and then do a second pass over all the records again in order to identify the problem.
The scope of a data quality dimension
As explained in my previous article, data quality problems can be reported for an individual value, a complete row, a complete column or the data set as a whole. This is what I will call in this article the scope of the data quality dimension.
The finer the scope of a dimension, the most accurate the data quality score can be computed, because double counting of problem can be avoided.
To illustrate this, imagine that in the same column you have 10% data class violations and 10% format violations. If these 2 data quality dimensions had a scope at column level — meaning if the problems were reported for the column as a whole — , it would be impossible to determine how many of the 10% data class violations are the same values as the ones with the format violations. That would mean that the data quality score of the column would be computed as 90% * 90% = 89%.
In reality those 2 dimensions can determine the problem at value level. Most likely when a value doesn’t have the expected data class, it will also not have the expected format, so that many values with the data class violations may be the same as the values with the format violations.
In that case, since a bad value is a bad value, no matter how many problems it has, the data quality score would be computed as 90%, indicating that 10% of the values are problematic.
This example shows that data quality dimensions of scope value are the most accurate. However for some problems it is not feasible or too expensive to identify the exact values having the problem. For this reason, some data quality dimensions only report their problem at column level.
Let’s now have a look at each individual data quality dimension.
Dimensions without any pass over the data
Duplicated values — scope: column
This dimension is looking for duplicate values in column which are expected to contain only unique values — for instance in primary key columns. Since searching for duplicate would be too expensive as part of the data quality analysis, because it would require to sort all column, this dimension reuses the statistics which are precomputed during data profiling (column analysis). This has the advantage that the dimension doesn’t need to look at the data again, but it can only report the frequency of the problem for the column as a whole and cannot tell which row exactly is affected by the problem. Because of this, it cannot be excluded that multiple problems affecting the same values are double counted.
In order to find out the exact rows containing a duplicate, you can use a data rule.
Dimensions requiring a single pass over the data
Data class violations — scope: value
This data quality dimension reports any value which doesn’t match the inferred or selected data class of its column.
But you need to know that not all data classes can be used to identify domain violations at value level. Like for data quality issues, data classes have a scope to indicate if they are applicable at value level, or only at column level.
Data classes working at value level — for instance all data classes based on a regular expression or a list of values — can return a clean match or no match decision for each tested values and can therefore be used to validate the values during the data quality analysis.
Data classes working at column level — for instance data classes like person name — only compute a confidence that the column as a whole is of a given class, but cannot test any individual value.
For this reason, if the data class assigned to a column is of scope column, the data quality analysis will ignore the data class and no data class violations will be searched on that particular column.
You can check the scope of a data class by looking at its definition Watson Knowledge Catalog:
Looking at the scope of a data class in Watson Knowledge Catalog
Data type violations — scope: value
This dimension counts the values of a column which are incompatible with its inferred or selected data type:
During the data profiling, the column analysis determines the inferred data type of each column. The inferred data type is independent from the data type as defined in the source and represents the most specific data type which could contain all values of a column.
To illustrate this notion, let’s imagine column defined in a table in a SQL data base with a data type VARCHAR(255). This data type is the defined data type on the source, but in reality the values of the column may contain dates encoded as a string. In that case the inferred data type may be DATE or TIMESTAMP, depending on the format of the values.
In another example, if the same column contains strings, but the longest string is only 40 characters long. Then the inferred type would be VARCHAR(40).
If the column contains a mix of different types — for instance dates and strings and numbers — then the inferred data type will be the most specific type which can represent all these values — in this example it would be VARCHAR(n), n being the length of the longest value.
The inferred data type can be overwritten by the user, to indicate which data type should be expected. If the selected data type is set to DATE for instance, then any value which is not a native date value or is not a string or a number which can be parsed as a date will be considered as a violation.
Format violations — scope: value
This dimension counts the values of a column with a format that has been marked as invalid for this column by a user.
This means that this dimension is only going to find issues for columns where identified formats have been declared as invalid by a user. To do so, you need to open the analysis results of the column in a data quality project and go to the tab “Formats” to see the list of all identified formats. Then you can enter the edit mode and mark the unwanted formats as invalid.
Edit the list of valid formats for a column
By doing this you add a constraint on the column indicating which formats are valid or not. This invalidates the current data quality score, which, when recomputed, will reflect the findings based on the new constraint.
Inconsistent capitalization — Scope: value
This dimension checks first if the majority of the values of a column follow any convention in regard to the usage of upper and lower case in strings. The capitalization rule of a column can be one of the following:
- Upper case: the values are written in upper case only. For example:
“THIS IS A VALUE” - Lower case: the values are written in lower case only. For example:
“this is a value” - Name case: The first letter of each word within the value is written in upper case, the rest of the words are in lower case. For example:
“This Is A Value” - Sentence case: The first letter of each value is in upper case, the rest is in lower case. For example:
“This is a value” - Any case: Upper case and lower case are mixed and no rule can be recognized. For example:
“This IS A value”
If the majority of the values in a column use one of these capitalization rules, the values which do not follow the same capitalization rule are considered as data quality issue.
If the data don’t show any strong pattern in the capitalization rule being used, then the analysis will assume that the capitalization is not constrained and will not search for any issue of that type.
Inconsistent representation of missing values — scope: value
This dimension reports a problem if both null and empty values or values containing only spaces are found within the same column.
Missing values — scope: value
This dimension searches for values which are either null or empty in columns which are not expected to contain missing values.
You can set in the “Column properties” tab of the column analysis results the flag indicating whether missing values are acceptable or not for a given column.
Specify if missing values are allowed or not in a column
Note that contrary to what the UI suggests, this is not only about NULL values, but missing values in general, which also include values containing an empty string or only spaces.
If no constraint has been specified, then the analysis will try to infer the nullability flag of the column by looking at the number of missing values in the column. If the number of missing values (null and empty values) is below the nullability threshold, then the column is considered as non nullable and any missing or empty value will be considered as a data quality issue. If the column contains more missing values than what the threshold allows, then the analysis will assume that missing values are expected in this column and won’t report an issue by default.
The default nullability threshold is 1%, meaning that only columns containing less than 1% missing values are considered as non nullable. You can set this threshold in the data quality settings of the project.
Values out of range — scope: value
This dimensions is only used if a range constraint specifying the minimum and/or maximum allowed value for a column has been specified.
You can define such a constraint by opening the profiling results of a column and entering the edit mode in the “Column Properties” tab as shown in the next screenshot. You can set either a minimum value or a maximum or both. Any value outside the specified range will be reported as having a data quality issue.
Specifying a minium and/or maximum allowed value for a column
Rule violations — scope: value
Data rules and quality rules are the easiest way to define custom dimensions.
The data quality analysis will consider all data rules or quality rules which are bound to columns of the analyzed data set and do not require any join operation with other data sets. If a rule match these criteria, its logic will be evaluated on the analyzed rows during the data quality analysis.
Whenever a row of the data set does not pass the rule, a data quality issue will be reported for each column bound to the rule for the given row. Each rule is reported under its own data quality dimension.
Note that it is not necessary to run data rules by themselves to get their logic evaluated as part of the data quality analysis. The inclusion of the logic of the rule as a constraint on the data set will be automatic.
Dimensions requiring two passes over the data
Suspect values — scope: value
This data quality dimension is less easy to explain, but at the same time it is quite a powerful dimension.
Basically this dimension searches for values which do not look like they would belong to the same domain as the other values of the same column. In some sense it is similar to searching for outliers, but the notion of outlier is extended to not only the value itself, but also its format and properties.
To best explain how this work, imagine that you see a list of printed phone numbers. Without looking at the individual values, the chance are good that you are going to recognize at the first glance that the values are phone numbers, because they just look like phone numbers.
The human brain doesn’t evaluate each individual value, but extracts some patterns regarding the format, the fact that they are mostly made of digits, the length of the values, etc… and matches that to what we have learned from the past how phone numbers typically look like.
At the same time, if in the middle of a long list of phone numbers, one of the values is not a phone number but maybe an email address, the human brain would immediately see that one of the values is obviously wrong, without having to read any individual values.
The suspect values dimension tries to simulate what the brain is doing in this simple example. To do so, it first computes for a column a domain fingerprint which contains identified patterns which seem to characterize the domain of the values. Once the fingerprint is computed, it makes a second pass over the data to try to identify which individual value seems to deviate a lot from these identified patterns.
The properties of the values which are considered in that decision are various and different depending on the nature of the data. It can consider the format of the values, the more frequent recurring tokens within the values, diverse statistics regarding the length, number of tokens, type of characters found in the values, the bigram distribution of the values, etc…
In the following example, the highlighted value in the first column would be detected as suspect, because its bigram distribution and number of tokens doesn’t is not consistent with the other values. In the second column, the value would be marked as suspect because it is a number in a column containing strings otherwise. In the third column, the highlighted value would be marked as suspect because it is a numeric outlier.
Example of suspect values
This dimension can be useful to detect non obvious data quality problems that a human would be able to see if pointed to them, but that may be hidden in the middle of many rows, so that they may be hard to recognize.
The drawback of this dimension is that it is more expensive because it requires one pass to learn what should be considered as expected and one pass to detect the suspect values.
What you should also understand is that this dimension doesn’t provide a deterministic rule when a value is considered as suspect and when not. It is based on probabilistic and comparison of values with what the majority of values look like. This may of course produce false positives or miss wrong values, just like the human brain may make mistakes.
Suspect values in correlated columns — scope: value
This dimension is complementary to the previous dimension. While the “Suspect values” dimension only looks if values of a same column are homogenous within their column, this dimension looks if combination of values in a single row are plausible when taken together.
For this, it first searches in a first pass if there are any correlation between columns. Two columns are correlated if the value of one column can be determined from the value of another column. If could be the case if for instance one column contains the result of an operation combining one or multiple other columns. It could also be the case if one column has an inclusion dependency on another column.
As a concrete example, imagine that you have a data set with a column containing city names and another column containing state names. For any given city there is normally only one state. This means that there is a correlation between these columns. The analysis can learn correlation rules from the records of the data set and identify data quality issues where city and state don’t seem to fit together because their combination would violate a correlation rule.
In simpler words: if a majority of the rows having San-Francisco as a city also have California as a State, then a row {San-Francisco; Arizona} will be reported as suspect, because this is a combination of values which breaks the correlation rules detected between the two columns when looking at all rows.
Like with the previous dimension, this approach is based on statistics and not on boolean rules and may produce false positives. Computing this dimension is also more memory hungry than the other dimensions — especially on data sets with a large number of columns — because the correlation tables between each pair of columns needs to be computed and kept in memory during the analysis.
On the other side, enabling this dimension allows to detect non trivial problem which would not be visible when looking at one column at a time. This is a data quality dimension which you may want to enable when doing a deep data quality analysis of chosen data sets and may be better disabled for large data discovery job where it is important to keep the jobs fast.
Summary
We have seen that not all data quality dimensions are equally expensive. Some are fairly simple and don’t make the analysis much slower. Some require multiple passes over the data and may double the analysis time but may help to identify non obvious problems. Different scenarios may require to enable different type of dimensions.
We have also seen some details about each of the standard data quality dimensions.
Finally we have seen that custom data quality dimensions can be added by using data rules.
Information Analyzer also allows you to program and deploy your own data quality dimensions by implementing them in Java. This capability is not yet available in Cloud Pak for Data in the current version (3.5 at the time of this writing). But maybe this is an extension point which will be ported to Cloud Pak for Data in the future…
What new data quality dimension do you think should be added to the list of standard dimensions? Let me know by adding some comment to this article.