Data quality checks

As the name suggests, HFCs are checks of incoming data conducted on a regular basis (ideally daily). High-frequency checks can be run on survey data or administrative data. Regardless of the source, they should be run on as much of the data as possible.

For survey data, HFCs are used for identifying and correcting data errors, monitoring survey progress, measuring enumerator performance, and detecting data fraud. HFCs play a similar role for administrative data but can also be used to check its coherence (the degree to which the administrative data are comparable to other data sources) and its accuracy (e.g., information on any known sources of errors in the administrative data) (Iwig et al. 2013).

Types of HFCs

HFCs fall into five broad categories:

  1. To detect errors: Identify if there are issues with the survey coding or problems with specific questions.
    1. Survey coding: Suppose question 1a asks “Do you have children under the age of 18?” followed by question 1b: “(if yes) Are they in school?” If respondents who answer “no” to the first question are shown the second question, then the skip pattern is not working correctly and should be fixed.
    2. Missing data: Are some questions skipped more than others? Are there questions that no respondents answered? This may indicate a programming error.
    3. Categorical variables: Are respondents selecting the given categories or are many respondents selecting “None of the above”, or “other”? If conducting a survey, you may want to add categories or modify your existing ones.
    4. Too many similar responses: Is there a question where all respondents answer in the same way?
    5. Outliers: Are some respondents reporting values drastically higher or lower than the average response? Do these variables need to be top or bottom coded? Many outlier checks can be directly programmed into the survey, either to flag responses or bar responses that are outside the acceptable range.
    6. Respondent IDs: Are there duplicates of your unique identifiers? If so, does the reason why make sense? (e.g., one circumstance in which there may be duplicates of unique IDs is when surveyors have to end and restart an interview.) Are there blank or invalid IDs? This might be a sign your surveyors are not interviewing the correct respondent. 
  2. To monitor survey progress and track respondents: Checking these variables allows research teams to forecast how long it will take to complete a round of surveys while also identifying surveyors who are performing poorly. 
    1. How long do surveyors take to do one survey?
    2. How many surveys do surveyors complete in a day?
    3. Are the surveys being completed in one sitting or do respondents take breaks or stop the survey early?
    4. Are the correct respondents tracked and surveyed? Can you match respondents between rounds of data collection and across sources of data?
    5. Variables that measure survey progress might not be present in the data per se, but they can be constructed. You can collapse the dataset by enumerator in order to get this information. SurveyCTO automatically generates some variables which can be used here, such as SubmissionDate, startdate and enddate.
  3. To monitor surveyor performance: Identify if there are differences in responses that correspond to surveyors.
    1. Distribution checks: Is it the case that one of your surveyors is reporting households with drastically higher incomes than others? You should look at the distribution of missing values, “I don’t know/Refuse to answer,” and “No” responses to skip-order questions to detect if surveyors are fraudulently shortening the survey to make their job easier.
    2. Number of outliers: Similar to the check for outliers when looking for data errors, but now you should check the number of outliers each enumerator has. Enumerators with a high number of outliers might need to be re-trained or might indicate the enumerator is fabricating the data.
    3. Number of inconsistent responses: Check if some surveyors have high numbers of impossible responses (e.g., they report the head of a household is 30 but has a 28-year-old child, or they report the respondent has a college degree but is illiterate). This is also a sign the enumerator might need more training or is fabricating data.
    4. Productivity: Examine the count of surveys completed, communities covered, refusal (respondent refuses to be interviewed), and tracking rates (percent of targeted respondents reached) by enumerator.
  4. To detect data fraud:
    1. Duration of survey: Extremely short surveys might be an indication that the surveyor fabricated the data. 
    2. Location audits using GPS: Depending on your devices, you might be able to record the GPS location of the interviews, which will allow you to see if the surveyor is where they are supposed to be–or if they are staying in one place and completing multiple surveys, which might be a sign of fraud. Note that collecting GPS requires IRB approval.
    3. Audio audits: Some survey platforms, like SurveyCTO, allow research teams to collect audio recordings. These recordings can either be listened to closely to see if the enumerator was asking the questions correctly, or can be analyzed to determine if there were multiple speakers or if there was any speech at all. Note that recording audio requires IRB approval. These checks might detect surveyors who are cutting costs by taking the survey themselves and making up data.
    4. Suspiciously high number of “no” responses for skip orders: Questions that only trigger additional questions if a respondent answers “yes” might be fraudulently reported as “no” so that the surveyor has to do less work. This can be detected by comparing the rates of “no” responses across surveyors.
    5. Suspiciously short sections: With some surveying platforms, you can code “speed limits” on questions, which will either forbid an enumerator from moving past a question until a certain time has passed or will flag questions where the enumerator advanced too quickly. This requires some up-front piloting of questions in order to know what the average amount of time spent on each question is. 
    6. Other considerations: Other checks for fraud may depend on the study’s context. See Finn and Ranchhod (2017) for more examples of potential checks for fraud, including comparing anthropometric data across survey waves.
  5. Special considerations for administrative data:
    1. Research teams should work with data providers to determine which variables can be checked for coherence (e.g., the average household income in this data should be no more than 2% off of the average household income reported in some other data source) as well for accuracy (e.g., there should be no more than 5% of households who don’t report an income each month).
    2. Detecting errors in administrative data is similar to detecting errors in survey data. In addition to the basic checks mentioned above you should also check variables for coherence and accuracy. Many administrative datasets are panel data, so you can perform additional logic checks (e.g., do respondents’ ages increase over time?).
    3. Tracking respondents is a primary goal with administrative data, both in the sense that you want to follow respondents over time and across datasets. Check if unique respondent IDs ever change (for instance, someone moves out of their parents’ house and creates a new household).  
    4. As you are not collecting the data, you might not know who was interviewed by which enumerator. Ideally you will work with the data provider to get this information. If the data provider is unwilling to share it, you should share any observations with issues with the data providers so they can work with their enumerators to ensure data quality.
    5. Your ability to detect data fraud depends largely on the coherence rules you determine with the data provider. Finding a high-quality dataset with similar respondents or in a similar context will help you determine if the data you are provided looks real or fraudulent.

Implementing HFCs

There are three main ways to implement HFCs:

  1. Custom do-files: This entails developing a do-file or R script checking for the above data quality issues. For examples, see the example custom HFC Stata and R code, and an HFC template (please note: The code in these templates is a work in progress, and we strongly recommend thoroughly testing it before using it on a project. If you have comments for improvements or modifications to the code, please submit them here). Customized do-files have the advantage of being flexible and are especially useful when standardized tools will not suit your needs but require time upfront to develop. Not every potential data quality issue is foreseeable, so custom do-files might need periodic updating. 
  2. IPA user-written commands: Innovations for Poverty Action (IPA) developed commands to conduct HFCs. These also require an upfront investment in order to understand what each command does and how to use them. 
  3. SurveyCTO built-in features can be used to automate many data quality checks.

Regardless of implementation method, it is best to prepare HFC procedures before enumerators go to the field.

On a daily basis, the Research Assistant should download the new data, run the HFC code on it, flag any issues, and send flagged responses to the PI/Research Manager. This is usually done by creating a spreadsheet with some basic information on the respondent (i.e, their unique ID, location, phone number, and the problematic response) so that field staff can contact them to verify their response. Once field teams have verified the data, a do-file can be used to fix or reconcile any errors (important: never directly edit or override the raw data! Always make edits in a do-file). This do-file can be updated regularly to incorporate new edits as you conduct HFCs on incoming batches of data. 

On an ongoing (i.e., weekly or monthly) basis, the RA should maintain the HFC code (e.g., makes necessary adjustments). Changes to the HFC code should be made if you modify the survey (e.g., adding a response that was commonly given as an “Other- please specify” to the set of options). As more data is collected, you may be able to perform additional tests, such as comparing surveyors in one district to surveyors in another, or comparing responses to the same surveyor in different districts. You may want to modify the code to include these as time goes on. Discuss with your PIs how often modifications should be made to the HFC code.

There are further considerations to take when conducting HFCs on remote survey data, including figuring out optimal call times and tracking the number of call attempts. See more in the “Best practices for WFH-CATI data quality” section below.