How to Improve Your Data Quality Assessment Process: Laura Sebastian-Coleman — Data Quality Pro

Dylan Jones : With a data quality assessment, most companies will define the scope of the assessment and document the requirements that data must be assessed against.

One problem I’ve seen with this approach is a lack of understanding around what the requirements should be.

How do you solve this problem?

Laura Sebastian-Coleman : Defining quality requirements for data is challenging. There are not a lot of examples out there. I think the problem is intensified by at least two things.

  1. One is that many organizations struggle with defining requirements generally.

  2. Second, with data quality requirements, people often put the cart before the horse, asking questions like: “What should the threshold be here?” rather than recognizing data quality requirements as essentially expectations about what condition the data should be in. Ideally, these expectations should be formulated in measurable ways but often stating them at all is a challenge. As with other requirements, the process of getting there is iterative.

In many ways, data quality assessment is requirements definition. We don’t always know what we expect data to look like until we begin to assess its condition. Once we start assessing, we see things in the data that enable us to articulate what we expect the data to look like. Requirements can be formulated in relation to observations made during assessment.

Based on those preliminaries, I am going to give two answers. A short one that summarizes the basic steps. And a long one that explains the steps.

The process steps are, roughly, these:

  • Establish scope – what data you will look at

  • Define your deliverables – what you expect the end product of the assessment to look like

  • Document what you know and what you don’t know about the data – these expectations are your starting point; the assessment will both test and refine them

  • Define the basic questions you will ask of each data set, data element, or rule – these should be based on the dimensions of quality that the assessment will include

  • Start observing the data and determining what makes sense and what looks questionable and why it makes sense or looks questionable

  • Use what you learn from the process to refine your approach – add or alter questions, dimensions, format of deliverable

  • Produce the deliverable – which should contain, at minimum, both observations and conclusions about what you have seen

  • Share the results, get feedback, and refine the results based on that feedback

The steps above are based on the assumption that these activities would be carried out by a team of people. If there is an existing team, then they would define scope and deliverables. If there is not a team, then the team composition would depend on how scope and deliverables are defined.

That makes it all seem tidy and simple. So on to the long description…

Assessment consists of looking at data, asking questions about it, and finding the answers to those questions in order to answer the fundamental quality questions: Does this data make sense? Is it usable? If it does not make sense, then analysis is required to understand why it is in the condition it is in.

Finding answers often entails comparing data to other ‘things’ that can tell you whether it is in the condition it needs to be in, and drawing conclusions about its condition.

The ‘things’ that serve as comparisons include:

  • common sense

  • direct input from data consumers about their expectations

  • knowledge of the processes that create the data

  • known facts about the data

  • other data that represents similar objects or events, and, over time, measurements of similar data.

In a data quality assessment, as in any project, it is important to begin with the end in mind. An assessment should have goals and defined deliverables. Goals can be general or specific.

For example, an assessment could be conducted simply to understand the condition and content of a data set.

Or one might be required in order to determine whether data from a new source can be integrated into an existing database.

In both cases, the deliverables are similar: descriptive information (observations) about the condition of the data and analysis of the implications of that condition.

The first step is to establish a workable scope.

What data will be assessed?

Scope may be defined as:

  • a single data element

  • a set of data elements

  • data related to a specific process

  • data within one data domain

  • data within one database

  • data across systems

The first time you execute an assessment it is best to limit scope, not only to ensure that you produce a usable deliverable in a reasonable amount of time, but also so you can establish a workable, repeatable process.

Once you have scope, you need to understand your own starting point by defining what you know and what you don’t know about the data and the processes it represents.

If you are assessing the condition of one data element about which you know nothing, then your starting point is that you do not know what to expect. This is the situation your question implies.

But when we think about it, it is relatively rare to know nothing about a data element or a data set.

The fact that data is captured at all should tell you something about it. The fact that someone wants it assessed means someone cares about what it contains. In most situations, we have at least some assumptions.

A key activity is making these assumptions explicit so that they can be tested. One reason the DQAF includes steps for assessing the condition of metadata and documentation about business processes is in order to establish fundamental assumptions about the data. Still, not having any specific expectations is a possible starting point.

The next step is to establish initial characteristics to look at and what questions to ask about those characteristics.

Again, the motivation for the assessment may provide some guidance in determining what to look at.

Assessments can be conducted at different levels:

  • column, field

  • rule

  • cross-column

  • record

  • cross-table

  • cross data set

I think it is best to start at the simplest level (the column) in order to understand the fundamental condition of data elements and then work outward toward more complex relationships.

In many cases, the work of explaining the condition of one column leads you to identify more complex relationships, such as cross-column rules.

In discussions of data quality, we often talk about dimensions of quality. These are useful categories for understanding the condition of data. Dimensions of quality provide some basic questions to ask about data at the column level that can be answered in relatively objective terms:

  • Completeness: How fully is the column populated? (i.e., for what percentage of records is a value provided?)

  • Validity: What percentage of records contains valid values / invalid values?

Answers to these questions are, simply, objective observations about the condition of the data.

The actual assessment comes with the follow up questions:

  • How fully populated should the column be and why?

  • Under what conditions is the column expected to be populated?

  • Under what conditions is the column expected to be not populated?

  • Do the records with invalid values share any characteristics (age, record type, system of origin, etc.) that distinguish them from the records with valid values?

  • Are there patterns in the data that are related to different systems of origin?

Once you have set up your basic approach, then start assessing the data and work the process.

Earlier, I stated that it is best to begin with the end in mind. An important part of the process is to establish how you will document the objective observations and conclusions from them.

My team uses a spreadsheet that includes a set of questions based on dimensions of quality and observable characteristics of data (default population, validity, cardinality, distributions of values, etc.) along with a set of Yes/No questions that explicitly ask for quality assessment.

This kind of tool is helpful because it enforces a disciplined look at the data and it also allows for the iterative process to take place. And it prepares the results so that they can be shared.

So if one analyst draws a conclusion: “This data does not make sense because it is supposed to represent current customers and 25% of the birthdates are before 1/1/1900.”, another analyst may be able to explain why the data is in the condition it is in – “That data was imported from another system in which birth date was not captured”.

From here, decisions can be made about whether to take actions to improve the quality of the data or whether to deem the data unusable.

Let’s take a simple example.

I got a help desk ticket that asserted the gender code field on the Provider table in our database was not populated correctly because it defaulted more than 30% of the time. The person who submitted the ticket clearly expected the field to be fully populated with valid values, non-defaulted values.

The first step in assessing the data is to know what you are supposed to be looking at. Doing so involves understanding basic terminology.

In health care, providers include individual people (doctors, nurses, physical therapists, etc.) who render medical services. But the definition is not limited to individuals. Facilities (hospitals, surgery centers, etc.) where medical services are rendered and groups of doctors in a practice together are also considered providers.

Analysis of the data confirmed that, indeed, 30% of provider records had a defaulted gender code of ‘U’ – Unknown.

However, analysis also showed that these records were associated with facilities and groups and therefore were not expected to have gender code populated.

Once the assessment was conducted, we could clarify expectations for the field:

  • All records for individual providers (people; provider type = ‘I’) are expected to be populated with a valid gender code.

  • All records associated with facilities or groups (provider type = ‘F’ or ‘G’) are expected to be populated with a default value for gender code.

These assertions are data quality requirements.

When you get to the point where you can make these kinds of assertions, you are also at the point where you can measure them: either the field is populated according to the rules or it is not.

From here it is possible to make other improvements.

For example, in this case, if the gender code applied to groups and facilities were expressed as Not Applicable, rather than Unknown, the results might be more comprehensible to a data consumer.

For more complex findings, especially those that uncover unacceptable conditions in the data, improvement projects can be launched.

To get value out of the assessment, these assertions need to be shared with other people who produce and use the data. They can confirm, refine, or reject the formulated quality expectations.

In some organizations, these expectations may be clarified and agreed to once and for all. In other organizations, they will be part of an ongoing conversation. In either case, they should be incorporated into supporting metadata, so that they do not need to be rediscovered.

Dylan Jones : For someone looking to implement their first data quality assessment, how should they include the business community?

Laura Sebastian-Coleman: I dislike the fact that in discussions of data management and data quality there is such a sense of separation between business people and technical people.

Data is so critical to non-technical functions that, for years, I have been perplexed as to why so many conversations about how to manage it are unproductive.

I think the conflicts are based in unrealistic expectations on both sides.

Technical people often profess not to know anything about the data. Further they assume that business people know exactly how data should look and what they want to do with it.

Business people, assume technical people understand at least the basics about the data they are charged with managing. Not that they expect expertise, but at least knowledge. Business people also assume that managing data means knowing consistent information about what data exists in any given system. Unfortunately, often this information is not gathered consistently.

Since I have finally accepted that this gulf exists, I look for ways to bridge it.

Coming from the technical side, I try to do my homework and help my business customers help my team do our job of measuring and monitoring the quality of data.

In the previous question, I stated that one of the inputs for assessments is direct input from data consumers about their expectations.

When and how to engage the business community for that input depends on a lot on the specific organization, especially on how teams work across disciplines. If there is good documentation and relationships are productive, business people can and should be engaged directly, from the start, as part of the team that conducts the assessment.

That said, some aspects of data assessment require specific technical skills and in some cases, access to data and tools needs to be limited. So if it is not possible for business people directly to conduct the assessment, they should be engaged to answer questions about business processes and to review both the observations and the results of the assessment.

If IT folks are solely responsible for the nuts and bolts of the assessment, it is very important to plan for how to communicate the results to business people. The assessment won’t produce value until the results are understood by people who produce or use the data.

Results can be used to identify which data should be monitored and how that monitoring can take place. They can then be the basis for process changes, technical changes, or other improvements.

Or the results may serve as valuable metadata that explains the condition of the data to data consumers. Business people will be the consumers of these applications of the results, so they need to understand and influence how all the pieces fit together.

Dylan Jones : One of the big challenges with measuring dimensions such as consistency is tracking data across very complex information chains that can often span multiple systems and even take many days or even weeks to complete. You’re not really looking at historical data, it’s more of a real-time, in-flight assessment. Have you experience of doing this type of assessment and what tips can you share?

Laura Sebastian-Coleman : Your question makes me think of several topics at once, all of which have a bearing on the quality of data. These include:

  • The different facets of the dimension of consistency

  • The relation between data assessment and other activities, such as information architecture and data modeling, related to developing data storage systems

  • The necessity of understanding data lineage – how data moves from one system to another – and being able to compare data across systems

The initial focus of the DQAF was on measurements and controls within databases. In that context, most of the consistency measures describe ways to quantify degree of similarity between different instances of the same data set.

However, behind this initial focus was a set of assumptions about the ways that data is integrated in databases. (Before you can assume that instances of the same data set should be similar, a lot of work has already been done to get data into a database.)

The challenges of data integration are first about data structure and then about content.

Unless you build an integrated system completely from scratch, you will need to account for the different choices that underlie the data models in the set of systems that serve as sources to an integrated database. Because these choices have direct effects on data content, I think it is important to understand structural questions first.

Sometimes people assume that the data in different systems should be the same, because similar concepts are represented. These folks then get tripped up by differences. I think it is generally safer to assume that data will be different in different systems. Then you can be grateful when there are similarities.

From the perspective of data content – if you expect two systems to represent similar content (for example, if you expect your transactional Customer system and the Customer tables in your data warehouse to contain the same number of customers for the same time frame) then you need to know how to accurately quantify the data content in both systems.

Tracing the information chain and understanding where data comes from is essential to this process.

Richard Wang and others from MIT have written about the concept of the Information Product. In those articles they describe the effort to trace the information itself, rather than the system.

I think this is a useful approach because it cautions that the information flow itself must be documented.

It also may help people avoid the trap of creating a system architecture document when what they really need is a flow chart. However, it is also important to document the impact of any system on critical information.

The first set of tips I have for this process are very similar to those I outlined in the first question: begin with the end in mind, ensure you have clearly defined scope and deliverables, document what you do and do not know.

Specific to the activity of creating information flows, I think the following things are important:

  • Don’t expect people already to have this kind of information in the form you need it in. In most cases it will not be available.

  • Plan on producing flows at different levels of detail, from a chart that shows the general movement of data from system to system to charts that might depict changes to specific data elements.

  • Be as consistent as possible in presenting the information so that you can use the charts as tools to do the real work of the assessment, which is very likely to be enabling cross-system comparisons.

  • Do not expect to get it completely right the first time – or even the second or third. This kind of documentation requires multiple iterations.

  • Have a plan for storing the results, sharing them, and building on them. This kind of knowledge is extremely valuable, but also work-intensive, so you want to lay the foundation only once.

Dylan Jones : Sometimes you find that stakeholders just want a general, overall score, for the quality of data. Is this feasible, or even beneficial, in your opinion?

Laura Sebastian-Coleman : My initial gut reaction is: No. This kind of measurement is neither feasible nor beneficial.

It makes me think of this situation: if you say the average temperature on earth is 60 degrees Fahrenheit, you have a single measurement. But it’s not a very helpful or convincing one if you are in Oslo in February or Cape Town in August. (I have a similar reaction when people say they want a dashboard not only before they have an engine, but also before they have defined what work their engine is supposed to accomplish.)

But then it occurs to me: if you are trying to answer questions related to the earth as a whole – say the problem with Global warming – then a series of average temperatures over time is exactly the kind of measurement that might convince people that we have a problem.

So I calm down and think: the goal of measurement is to convey information to people in a manner that allows them to understand it consistently across time.

People in leadership positions like scores. Scores are easy to remember and they provide a way to identify whether progress has been made and whether the rate of progress is acceptable.

There are practical implications to not providing data quality measurements to leadership. And it is important to provide these measurements in a way that helps leadership support data quality work, so whatever score is produced would need to be meaningful.

But the idea of creating one number that represents the overall quality of data points to another problem; there are practical implications of trying to create one score and have it be meaningful. It is a lot of work. Organizations that have some level of data quality measurement often have hundreds of individual measurements. It is not easy to roll these up into one number in a meaningful way.

In order to show some of the complexity, I’ll give the simplest example, validity.

In one of our systems, we have an automate process that measures the validity of 2500 columns each quarter.

These are spread across eight data domains. Because of how we collect the data – for each column collect a distribution of values and assign a validity indicator to each – we can report validity as one overall score: the percentage of columns with at least one invalid value.

Producing this number takes one query. However, to disentangle something like the number of records with at least one invalid value would take significantly more work, because we look at numerous attributes on any given table and we would need to account for how many times we count one record and so forth.

So what is the most useful ‘score’?

The one that is as simple as possible while still being meaningful.

The other factors to take into consideration when reporting on data quality are connected with how to present the information.

In our validity report, we can provide one number (the percentage of columns with at least one invalid value), but we also present a graph at the subject area level showing the relation between valid and invalid over time.

The graph is based on the same data as the score is. But it is conveys additional valuable detail (such as which data domains have the largest concentration of columns with invalid values) compressed in a visual form.

Anyone creating reports on data quality should familiarize themselves with best practices for visualization; for example, by reviewing Edward Tufte’s classic work, The Visual Display of Quantitative Information or the work of Stephen Few and others.

Long and short: I think the only way to meet the need for a score is to balance the competing demands for simplicity and meaning.

Leadership may say they want one number, but they are very likely to be receptive to a set of scores, say at the level of a dimension of quality or a data domain, as long as they understand the scores well enough so they can use them with confidence.

Dylan Jones: As ever, thank you for sharing such detailed insights Laura, I’ve really found your book to be incredibly useful and it’s now a vital part of my ‘data quality toolkit’.

Laura Sebastian-Coleman : Not a problem Dylan, I hope your readers find the information useful and thanks for the kind words about the book.