A Review of Data Quality Assessment Methods for Public Health Information Systems
Mục lục
Abstract
High quality data and effective data quality assessment are required for accurately evaluating the impact of public health interventions and measuring public health outcomes. Data, data use, and data collection process, as the three dimensions of data quality, all need to be assessed for overall data quality assessment. We reviewed current data quality assessment methods. The relevant study was identified in major databases and well-known institutional websites. We found the dimension of data was most frequently assessed. Completeness, accuracy, and timeliness were the three most-used attributes among a total of 49 attributes of data quality. The major quantitative assessment methods were descriptive surveys and data audits, whereas the common qualitative assessment methods were interview and documentation review. The limitations of the reviewed studies included inattentiveness to data use and data collection process, inconsistency in the definition of attributes of data quality, failure to address data users’ concerns and a lack of systematic procedures in data quality assessment. This review study is limited by the coverage of the databases and the breadth of public health information systems. Further research could develop consistent data quality definitions and attributes. More research efforts should be given to assess the quality of data use and the quality of data collection process.
Keywords:
data quality, information quality, data use, data collection process, evaluation, assessment, public health, population health, information systems
1. Introduction
Public health is “the science and art of preventing disease, prolonging life, and promoting physical health and efficiency through organized community efforts” [1]. The ultimate goal of public health is to improve health at the population level, and this is achieved through the collective mechanisms and actions of public health authorities within the government context [1,2]. Three functions of public health agencies have been defined: assessment of health status and health needs, policy development to serve the public interest, and assurance that necessary services are provided [2,3]. Since data, information and knowledge underpin these three functions, public health is inherently a data-intensive domain [3,4]. High quality data are the prerequisite for better information, better decision-making and better population health [5].
Public health data represent and reflect the health and wellbeing of the population, the determinants of health, public health interventions and system resources [6]. The data on health and wellbeing comprise measures of mortality, ill health, and disability. The levels and distribution of the determinants of health are measured in terms of biomedical, behavioral, socioeconomic and environmental risk factors. Data on public health interventions include prevention and health promotion activities, while those on system resources encompass material, funding, workforce, and other information [6].
Public health data are used to monitor trends in the health and wellbeing of the community and of health determinants. Also, they are used to assess the risks of adverse health effects associated with certain determinants, and the positive effects associated with protective factors. The data inform the development of public health policy and the establishment of priorities for investment in interventions aimed at modifying health determinants. They are also used to monitor and evaluate the implementation, cost and outcomes of public health interventions, and to implement surveillance of emerging health issues [6].
Thus, public health data can help public health agencies to make appropriate decisions, take effective and efficient action, and evaluate the outcomes [7,8]. For example, health indicators set up the goals for the relevant government-funded public health agencies [5]. Well-known health indicators are the Millennium Development Goals (MDGs) 2015 for the United Nations member states [9]; the European Core Health Indicators for member countries of the European Union [10]; “Healthy People” in the United States, which set up 10-year national objectives for improving the health of US citizens [11]; “Australia: The Healthiest Country by 2020” that battles lifestyle risk factors for chronic disease [12]; and “Healthy China 2020”, an important health strategy to improve the public’s health in China [13].
Public health data are generated from public health practice, with data sources being population-based and institution-based [5,6]. Population-based data are collected through censuses, civil registrations, and population surveys. Institution-based data are obtained from individual health records and administrative records of health institutions [5]. The data stored in public health information systems (PHIS) must first undergo collection, storage, processing, and compilation. The procured data can then be retrieved, analyzed, and disseminated. Finally, the data will be used for decision-making to guide public health practice [5]. Therefore, the data flows in a public health practice lifecycle consist of three phases: data, data collection process and use of data.
PHIS, whether paper-based or electronic, are the repositories of public health data. The systematic application of information and communication technologies (ICTs) to public health has seen the proliferation of computerized PHIS around the world [14,15,16]. These distributed systems collect coordinated, timely, and useful multi-source data, such as those collected by nation-wide PHIS from health and other sectors [17]. These systems are usually population-based, and recognized by government-owned public health agencies [18].
The computerized PHIS are developed with broad objectives, such as to provide alerts and early warning, support public health management, stimulate research, and to assist health status and trend analyses [19]. Significant advantages of PHIS are their capability of electronic data collection, as well as the transmission and interchange of data, to promote public health agencies’ timely access to information [15,20]. The automated mechanisms of numeric checks and alerts can improve validity and reliability of the data collected. These functions contribute to data management, thereby leading to the improvement in data quality [21,22].
Negative effects of poor data quality, however, have often been reported. For example, Australian researchers reported coding errors due to poor quality documentations in the clinical information systems. These errors had consequently led to inaccurate hospital performance measurement, inappropriate allocation of health funding, and failure in public health surveillance [23].
The establishment of information systems driven by the needs of single-disease programs may cause excessive data demand and fragmented PHIS systems, which undermine data quality [5,24]. Studies in China, the United Kingdom and Pakistan reported data users’ lack of trust in the quality of AIDS, cancer, and health management information systems due to unreliable or uncertain data [25,26,27].
Sound and reliable data quality assessment is thus vital to obtain the high data quality which enhances users’ confidence in public health authorities and their performance [19,24]. As countries monitor and evaluate the performance and progress of established public health indicators, the need for data quality assessment in PHIS that store the performance-and-progress-related data has never been greater [24,28,29]. Nowadays, data quality assessment that has been recommended for ensuring the quality of data in PHIS becomes widespread acceptance in routine public health practice [19,24].
Data quality in public health has different definitions from different perspectives. These include: “fit for use in the context of data users” [30], (p. 2); “timely and reliable data essential for public health core functions at all levels of government” [31], (p. 114) and “accurate, reliable, valid, and trusted data in integrated public health informatics networks” [32]. Whether the specific data quality requirements are met is usually measured along a certain number of data quality dimensions. A dimension of data quality represents or reflects an aspect or construct of data quality [33].
Data quality is recognized as a multi-dimensional concept across public health and other sectors [30,33,34,35]. Following the “information chain” perspective, Karr et al. used “three hyper-dimensions” (i.e., process, data and user) to group a set of conceptual dimensions of data quality [35]. Accordingly, the methods for assessment of data quality must be useful to assess these three dimensions [35]. We adopted the approach of Karr et al. because their typology provided a comprehensive perspective for classifying data quality assessment. However, we replace “process” by “data collection process” and “user” by “data use”. “Process” is a broad term and may be considered as the whole process of data flows, including data and use of data. “User” is a specific term related to data users or consumers and may ignore the use of data. To accurately reflect the data flows in the context of public health, we define the three dimensions of data quality as data, data use and data collection process. The dimension of data focuses on data values or data schemas at record/table level or database level [35]. The dimension of data use, related to use and user, is the degree and manner in which data are used [35]. The dimension of data collection process refers to the generation, assembly, description and maintenance of data [35] before data are stored in PHIS.
Data quality assessment methods generally base on the measurement theory [35,36,37,38]. Each dimension of data quality consists of a set of attributes. Each attribute characterizes a specific data quality requirement, thereby offering the standard for data quality assessment [35]. Each attribute can be measured by different methods; therefore, there is flexibility in methods used to measure data quality [36,37,38]. As the three dimensions of data quality are embedded in the lifecycle of public health practice, we propose a conceptual framework for data quality assessment in PHIS ( ).
Open in a separate window
Although data quality has always been an important topic in public health, we have identified a lack of systematic review of data quality assessment methods for PHIS. This is the motivation for this study because knowledge about current developments in methods for data quality assessment is essential for research and practice in public health informatics. This study aims to investigate and compare the methods for data quality assessment of PHIS so as to identify possible patterns and trends emerging over the first decade of the 21st century. We take a qualitative systematic review approach using our proposed conceptual framework.
2. Methods
2.1. Literature Search
We identified publications by searching several electronic bibliographic databases. These included Scopus, IEEE Xplore, Web of Science, ScienceDirect, PubMed, Cochrane Library and ProQuest. Because many public health institutes also published guidelines, frameworks, or instruments to guide the institutional approach to assess data quality, some well-known institutions’ websites were also reviewed to search for relevant literature. The following words and MeSH headings were used individually or in combination: “data quality”, “information quality”, “public health”, “population health”, “information system *”, “assess *”, “evaluat *”. (“*” was used to find the variations of some word stems.) The articles were confined to those published in English and Chinese language.
The first author performed the literature search between June 2012 and October 2013. The inclusion criteria were peer-refereed empirical studies or institutional reports of data quality assessment in public health or PHIS during the period 2001–2013. The exclusion criteria were narrative reviews, expert opinion, correspondence and commentaries in the topic area. To improve coverage, a manual search of the literature was conducted to identify papers referenced by other publications, papers and well-known authors, and papers from personal databases.
2.2. Selection of Publications
Citations identified in the literature search were screened by title and abstract for decisions about inclusion or exclusion in this review. If there was uncertainty about the relevance of a citation, the full-text was retrieved and checked. A total of 202 publications were identified and were manually screened. If there was uncertainty about whether to include a publication, its relevance was checked by the fourth author. Finally 39 publications that met the inclusion criteria were selected. The screening process is summarized in .
Open in a separate window
2.3. Data Abstraction
The selected publications were stored in an EndNote library. Data extracted from the publications included author, year of publication, aim of data quality assessment, country and context of the study, function and scope of the PHIS, definition of data quality, methods for data quality assessment, study design, data collection methods, data collected, research procedure, methods for data analysis, key findings, conclusions and limitations.
The 39 publications were placed in two groups according to whether they were published by a public health institution at national or international level or by individual researchers. If the article was published by the former, it is referred to as an institutional publication, if by the latter, as a research paper.
4. Discussion
Data are essential to public health. They represent and reflect public health practice. The broad application of data in PHIS for the evaluation of public health accountability and performance has raised the awareness of public health agencies of data quality, and of methods and approaches for its assessment. We systematically reviewed the current status of quality assessment for each of the three dimensions of data quality: data, data collection process and data use. The results suggest that the theory of measurement has been applied either explicitly or implicitly in the development of data quality assessment methods for PHIS. The majority of previous studies assessed data quality by a set of attributes using certain measures. Our findings, based on the proposed conceptual framework of data quality assessment for public health, also identified the gaps existed in the methods included in this review.
The importance of systematic, scientific data quality assessment needs to be highlighted. All three dimensions of data quality, data, data use and data collection process, need to be systematically evaluated. To date, the three dimensions of data quality were not given the same weight across the reviewed studies. The quality of data use and data collection process has not received adequate attention. This lack of recognition of data use and data collection process might reflect a lack of consensus on the dimensions of data quality. Because of the equal contributions of these three dimensions to data quality, they should be given equal weight in data quality assessment. Further development in methods to assess data collection process and data use is required.
Effort should also be directed towards clear conceptualisation of the definitions of the relevant terms that are commonly used to describe and measure data quality, such as the dimensions and attributes of data quality. The lack of clear definition of the key terms creates confusions and uncertainties and undermines the validity and reliability of data quality assessment methods. An ontology-based exploration and evaluation from the perspective of data users will be useful for future development in this field [33,75]. Two steps that involve conceptualization of data quality attributes and operationalization of corresponding measures need to be taken seriously into consideration and rationally followed as shown in our proposed conceptual framework.
Data quality assessment should use mixed methods (e.g., qualitative and quantitative assessment methods) to assess data from multiple sources (e.g., records, organisational documentation, data collection process and data users) and used at different levels of the organisation [33,35,36,38,75,76]. More precisely, we strongly suggest that subjective assessments of end-users’ or customers’ perspectives be an indispensible component in data quality assessment for PHIS. The importance of this strategy has long been articulated by the researchers [33,75,76]. Objective assessment methods assess the data that were already collected and stored in the PHIS. Many methods have been developed, widely accepted and used in practice [38,76]. On the other hand, subjective assessments provide a supplement to objective data quality assessment. For example, interview is useful for the identification of the root causes of poor data quality and for the design of effective strategies to improve data quality. Meanwhile, field observation and validation is necessary wherever it is possible because reference of data to the real world will give data users confidence in the data quality and in application of data to public health decision-making, action, and outcomes [52]. The validity of a study would be doubtful if the quality of data could not be verified in the field [36], especially when the data are come from a PHIS consisting of secondary data.
To increase the rigor of data quality assessment, the relevant statistical principles for sample size calculation, research design, measurement and analysis need to be adhered to. Use of convenience or specifically chosen sampling methods in 24 studies included in this review reduced the representativeness and generalizability of the findings of these studies. At the same time, reporting of data quality assessment needs to present the detailed procedures and methods used for the study, the findings and limitations. The relatively simple data analysis methods using only descriptive statistics could lead to loss of useful supportive information.
Finally, to address the gaps identified in this review, we suggest re-prioritizing the orientation of data quality assessment in future studies. Data quality is influenced by technical, organizational, behavioural and environmental factors [35,41]. It covers large information systems contexts, specific knowledge and multi-disciplinary techniques [33,35,75]. Data quality in the reviewed studies is frequently assessed as a component of the quality or effectiveness or performance of the PHIS. This may reflect that the major concern of public health is in managerial efficiency, especially of the PHIS institutions. Also, this may reflect differences in the resources available to, and the responsibilities of institutions and individual researchers. However, data quality assessment hidden within other scopes may lead to ignorance of data management and thereby the unawareness of data quality problems enduring in public health practice. Data quality needs to be positioned at the forefront of public health as a distinct area that deserves specific scientific research and management investment.
While this review provides a detailed overview of data quality assessment issues, there are some limitations in its coverage, constrained by the access to the databases and the breadth of public health information systems making it challenge to conduct systematic comparison among studies. The search was limited by a lack of subject headings for data quality of PHIS in MeSH terms. This could cause our search to miss some relevant publications. To compensate for this limitation, we used the strategy of searching well-known institutional publications and manually searching the references of each article retrieved.
Our classification process was primarily subjective. It is possible that some original researchers disagree with our interpretations. Each assessment method has contributions and limitations which make the choices difficult. We provided some examples of approaches to these issues.
In addition, our evaluation is limited by an incomplete presentation of details in some of the papers that we reviewed. A comprehensive data quality assessment method includes a set of guidelines and techniques that defines a rational process to assess data quality [37]. The detailed procedure of data analysis, data quality requirements analysis, and identification of critical attributes is rarely given in the reviewed papers. A lack of adequate detail in the original studies could have affected the validity of some of our conclusions.
5. Conclusions
Public health is a data-intensive field which needs high-quality data to support public health assessment, decision-making and to assure the health of communities. Data quality assessment is important for public health. In this review of the literature we have examined the data quality assessment methods based on our proposed conceptual framework. This framework incorporates the three dimensions of data quality in the assessment methods for overall data quality: data, data use and data collection process. We found that the dimension of the data themselves was most frequently assessed in previous studies. Most methods for data quality assessment evaluated a set of attributes using relevant measures. Completeness, accuracy, and timeliness were the three most-assessed attributes. Quantitative data quality assessment primarily used descriptive surveys and data audits, while qualitative data quality assessment methods include primarily interview, documentation review and field observation.
We found that data-use and data-process have not been given adequate attention, although they were equally important factors which determine the quality of data. Other limitations of the previous studies were inconsistency in the definition of the attributes of data quality, failure to address data users’ concerns and a lack of triangulation of mixed methods for data quality assessment. The reliability and validity of the data quality assessment were rarely reported. These gaps suggest that in the future, data quality assessment for public health needs to consider equally the three dimensions of data quality, data, data use and data process. More work is needed to develop clear and consistent definitions of data quality and systematic methods and approaches for data quality assessment.
The results of this review highlight the need for the development of data quality assessment methods. As suggested by our proposed conceptual framework, future data quality assessment needs to equally pay attention to the three dimensions of data quality. Measuring the perceptions of end users or consumers towards data quality will enrich our understanding of data quality issues. Clear conceptualization, scientific and systematic operationalization of assessment will ensure the reliability and validity of the measurement of data quality. New theories on data quality assessment for PHIS may also be developed.
Acknowledgments
The authors wish to gratefully acknowledge the help of Madeleine Strong Cincotta in the final language editing of this paper.
Table A1
Authors YearAttributes Major measuresStudy designData collection methodsData analysis methodsContributionLimitationsAncker et al. 2011 [59]Percentage of missing data, inconsistencies and potential errors of different variables; number of duplicate records, number of non-standardization of vocabulary, number of inappropriate fieldsQuantitative audit of data attributes of dataset.Selected one data set and used tools to query 30 variables, manually assessed data formatsRates, percentage or countsIdentified data quality issues and their root causes.Need a specific data query toolBosch-Capblanch et al. 2009 [58]Accuracy
Proportions in the relevant data set, such as the recounted number of indicator’s data by the reported number at the next tier in the reporting system. A ratio less than 100% indicates “over-reporting”; a ratio over 100% suggests “under-reporting”Quantitative audit of data accuracy by external auditors applying WHO DQA in 41 countriesA multistage weighted representative random sampling procedure, field visits verifying the reported data. Compared data collected from fields with the reports at the next tierPercentage, median, inter-quartile range, 95% confidence intervals, ratio (verification factor quotient) adjusted and extrapolatedSystematic methodology to describe data quality and identify basic recording and reporting practices as key factors and good practicesLimited attributes, lack of verification of source of actual data and excluded non-eligible districtsCDC 2001 [15]Completeness, accuracy
Percentage of blank or unknown responses, ratio of recorded data values over true valuesQuantitative audit of dataset, a review of sampled data, a special record linkage, or a patient interviewCalculating the percentage of blank or unknown responses to items on recording forms, reviewing sampled data, conducting record linkage, or a patient interviewDescriptive statistics: percentageProvides generic guidelinesLack of detail on procedures, needs adjustmentChiba et al. 2012 [57]Completeness: percentage of complete data.
Accuracy: 1-percentage of the complete data which were illegible, wrongly coded, inappropriate and unrecognized.
Relevance: comparing the data categories with those in upper level report to evaluate whether the data collected satisfied management information needsQuantitative verification of data accuracy and completeness, and qualitative verification of data relevance in a retrospective comparative case studyPurposive sampling, clinical visits, re-entered and audited 30 data categories of one year data to evaluate accuracy and completeness; qualitatively examined data categories and instructions to assess the relevance, completeness and accuracy of the data, semi-structured interviews to capture factors that influence data qualityDescriptive statistics for accuracy and completeness of the data. Qualitative data were thematically grouped and analyzed by data categories, instructions, and key informants’ viewsQuantitative and qualitative verification of data quality; comparison of two hospitals increased generalizability of the findingsConsistency and timeliness were not assessed. Data from the system were not able to be validatedCIHI 2009 [30]Accuracy: coverage, capture and collection, unit non-response, item (partial) non-response, measurement error, edit and imputation, processing and estimation. Timeliness: data currency at the time of release, documentation currency. Comparability: data dictionary standards, standardization, linkage, equivalency, historical comparability. Usability: accessibility, documentation, interpretability.
Relevance: adaptability, value.Quantitative method, user survey-questionnaireQuestionnaire by asking users, three ratings of each construct, including met, not met, unknown or not applicable (or minimal or none, moderate, significant or unknown) All levels of the system were taken into account in the assessmentDescriptive statistics for ratings by each criterion, the overall assessment for a criterion based on the worst assessment of the applicable levelsData quality assessed from user’s perspective provides comprehensive characteristics and criteria of each dimension of data quality. 5 dimensions, 19 characteristics and 61criteriaUndefined procedures of survey including sample size. Being an internal assessment, rating scores were used for internal purposesClayton et al. 2013 [56]Accuracy
Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV)Quantitative method to audit dataset by power calculation of 840 medical recordsTwo stage sampling of study sites, abstracting records and auditing 25 data variables to assess accuracy of the data reported on three data sourcesDescriptive statistics were calculated for each data sources; summary measure of kappa values sing the paired sample Wilcoxon signed rank testAccessing and linking three data sources—maternal medical charts, birth certificates and hospital discharge data whose access is limited and using the medical chart as the gold standardLimited generalizability of the findings; low sample size and limited representativenessCorriols et al. 2008 [55]Under-reporting
Calculating the difference between registered cases and surveyed casesQuantitative method to administer a cross-sectional survey in the country4 stage consistent random sampling method across the country. Face-to-face interview questionnaire survey.Descriptive statistics for estimation of national underreporting by using survey resultsGood representativeness of the study populationLack of case diagnosis information and the quality of the source of the dataDai et al. 2011 [69]Under-reporting, errors on report forms, errors resulted from data entry; completeness of information, accuracy, timelinessQualitative and quantitative methods by reviewing publications on the system and data from the systemReviewing publications on the system and data from the systemDescriptive statistics for quantitative data and thematically grouping for qualitative dataEvaluated all existing sub-systems included in the systemUndefined procedures of review, lack of verification of source dataAuthors YearAttributes Major measuresStudy designData collection methodsData analysis methodsContributionLimitationsDixon et al. 2011 [54]Completeness
The proportion of diagnosed cases and the proportion of fields in a case reportQuantitative method by auditing datasetCreating a minimum data set of 18 key data elements, using structured query language (SQL) statements to calculate the percent completeness of each field of a total of 7.5 million laboratory reportsDescriptive statistics to calculate the difference between the completeness scores across samplesDevelopment of a method for evaluating the completeness of laboratory dataNeed a specific data query tool and only assessed completenessEdmond et al. 2011 [68]Completeness, illegible hand writing, calculation errors
The proportion of the consultation rates for two items, the proportion of illegible hand writing and required clarification, and the proportion of calculation errors on the submitted record formsQuantitative method: audit the submitted record forms in the dataset3303 cards from randomly selected five weeks from each year between 2003 and 2009Descriptive statistics for the percentage of each data quality attributeRandom selection of datasetOnly calculated completeness, without field verification of accuracy of dataFord et al. 2007 [53]Accuracy
Sensitivity, specificity and positive predictive valuesQuantitative method to use record linkage to audit dataset, comparing the system with a gold standard (a statewide audit dataset)Calculated data quality indicators for 18 data variables, compared with a statewide audit (gold standard), including 2432 babies admitted to NICUs, 1994–1996Descriptive statistics with exact binomial confidence intervals for data quality attributes, comparing two datasets by using the chi-square testThe findings are consistent with other validation studies that compare routinely collected population health data with medical recordsLack of verification of variations between two datasets, inadequate representativenessForster et al. 2008 [67]Missing data
The percentage of the missing dataQuantitative method to audit datasetAssessed data quality of a set of six key variables. A global missing data index was computed determining the median of the percentages missing data. Sites were ranked according to this indexConfidence interval (CI), Conbach’s, multivariate logic models, Spearman rank correlation coefficientDirectly examined associations between site characteristics and data qualityConvenience sample and uncertain generalizabilityFreestone et al. 2012 [52]Accuracy, consistency, granularityQuantitative method to audit dataset from three components: source documents, data extraction/transposition, and data cleaningSystematic sampling 200 cases, each geocoded and comparatively assessed of data quality with and without the influence of geocoding, by pre-selected criteriaData quality measured by category: perfect, near perfect, poor. Paired t-test for 200 samples and chi-square test for yearQuantify data quality attributes with different factorsNo reference type and no field verification (for historic data)Frizzelle et al. 2009 [51]Accuracy, completeness, currency
Assessed by positional errors, generalizations incompatible with highly accurate geospatial locations, updated with the changeQuantitative method to use geographic information systems (GIS) by developing a custom road dataset for analyzing data quality of four datasetsDeveloped a custom road dataset, and compared with four readily available public and commercial road datasets; developed three analytical measures to assess the comparative data qualityPercentage, concordance coefficients and Pearson correlation coefficientsExemplary to assessing the feasibility of readily available commercial or public road datasets and outlines the steps of developing a custom datasetNo field verification for historic dataHahn et al. 2013 [50]Completeness, accuracy
The percentage of correctly or completely transmitted items from the original data source to secondary data sourcesA multiple case study by quantitative and qualitative approaches in 3 antenatal care clinics of two private and one public Kenyan hospitalQuantitative method: selected 11 data tracer items followed retrospectively and audited compared to independently created gold standard. Qualitative methods: structured interviews and qualitative in-depth interviews to assess the subjective dimensions of data quality. Five-point scales were used for each statement. Purposeful sampling of 44 staff for survey and 15 staff for key informants interviewsQuantitative data: manual review, descriptive statistics, Kruskal-Wallis test, Mann-Whitney U test for continuous measures. Qualitative data: processed manually and classified and grouped by facility and staff classCombining different methods and viewing the information systems from different viewpoints, covering the quality of PHIS and drawing suggestions for improvement of data quality from qualitative results, likely to produce robust results in other settings
Harper et al. 2011 [66]Completeness: the proportion of filled fields on the reports. Validity: the proportion of the number of the written indicators against the assigned standard; the proportion of entered incorrect numbers; the proportion of illegible entries; the proportion of entries out of chronological orderQuantitative method to audit an electronic database that was manually extracted entries of a reference syndrome from anonymized dataset from the E-Book health registry entriesUsing a random systematic sample of 10% of the extracted entries (i.e., beginning with a randomly chosen starting point and then performing interval sampling to check 10% of records), with an acceptable error rate of <5%Descriptive statistics on attributes. To avoid bias, age and sex proportions were extracted from available records, the proportions compared to National Census data.Examine data quality using a reference syndrome, thus making it possible to provide informed recommendations. Descriptive data analysis provides grounded and useful information for decision makersNo evaluation of data collection methodsHills et al. 2012 [73]Timeliness: the number of days between Service Date and Entry Date of submission of data to the system (three categories: ≤7 days, =8–30 days, and ≥31 days).
Completeness: the complete recording of data elements by calculating the proportion of complete fields over total number of fieldsQuantitative method to audit data setUse a de-identified 757,476 demographic records and 2,634,101 vaccination records from the systemDescriptive statistics on attributesLarge dataset provides a statistically significant associationNot able to examine two highly relevant components of data quality: vaccination record coverage completeness and accuracyLash et al. 2012 [74]Completeness: the number of locations matching to latitude and longitude coordinates.
Positional accuracy: spatial resolution of the dataset. Concordance: the number of localities falling within the boundary. Repeatability: the georeferencing methodologyGeoreferencing historic datasets, quantitative method research historic data with 404 recorded MPX cases in seven countries during 1970–1986 from 231 unique localitiesDevelop ecological niche models and maps of potential MPX distributions based on each of the three occurrence data sets with different georeferencing effortsDescriptive statistics on attributes and comparison of georeferencing match ratesDocument the difficulties and limitations in the available methods for georeferencing with historic disease data in foreign locations with poor geographic reference information.Not able to examine the accuracy of data sourceLin et al. 2012 [65]Completeness: sufficient sample size. Accuracy: data missing or discrepancies between questionnaires and databaseQuantitative and qualitative methods, auditing data set by cross-checking 5% questionnaires against the electronic database during the field visitsReview guidelines and protocols using a detailed checklist; purposive sampling; direct observations of data collection; cross-checking compared database with the questionnairesDescriptive statistics for attributes of data qualityMixed-methods to assess data qualityUnable to generalize the findings to the whole systemLitow and Krahl 2007 [64]Accuracy, use of standards, completeness, timeliness, and accessibilityQuantitative method based on a framework developed for assessment of PHISExported and queried one year data by 12 data itemsDescriptive statistics for data quality attributesResearch on Navy population for public health applicability of the system and identified factors influencing data qualityNeeds a framework which was undefined in the researchLowrance et al. 2007 [63]Completeness, updated-ness, accuracyQualitative method by following CDC’s Guidelines with qualitative methodsStandardized interviews with 18 key informants during 12 site visits, and meetings with stakeholders from government, non-governmental and faith-based organizations.Thematically grouping interview responsesData quality qualitatively assessed by key informants and stakeholdersLack of quantifiable informationAuthors YearAttributes Major measuresStudy designData collection methodsData analysis methodsContributionLimitationsMakombe et al. 2008 [49]Completeness: filled fields; accuracy: no missing examined variables or a difference less than 5% compared to the supervision reportQuantitative methods to audit the quality of site reports as of the date of field supervisory visits6 case registration fields and 2 outcome data were examinedDescriptive statistics on attributes of data quality from site reported were compared to those of supervision reports (“gold standard”)Set up thresholds of accuracy, examine association between facility characteristics and data qualityOnly assessed aggregated facility-level rather individual patient dataMate et al. 2009 [48]Completeness: no missing data in a period of time; accuracy: the value in the database was within 10% of the gold standard value or percentage deviation from expected for each data element when compared to the gold standard data setQuantitative methods to assess attributes. Completeness: surveying six data elements in one year dataset from all sample sites. Accuracy: surveying a random sample sites in three months to assess variation of three steps in data collection and reportingExtracted one year dataset for surveying data completeness of six data elements. Randomization sampling. Paralleled collection of raw data by on-site audit of the original data. Reconstructed an objective, quality-assured “gold standard” report dataset. All clinical sites were surveyed for data completeness, 99 sites were sampled for data accuracyDescriptive statistics, by using charts, average magnitude of deviation from expected, and data concordance analysis between reported data and reconstructed datasetLarge sample size, randomized sampling technique, the use of an objective, quality-assured “gold standard” report generated by on-site audit of the original data to evaluate the accuracy of data elements reported in the PHIS. Set up thresholds of accuracy and errorsSources of data were not verifiedMatheson et al. 2012 [71] *Missing data, invalid data, data cleaning, data management processesNot conductedN/AN/AN/ALack of specific metricsME DQA 2008 [34]Accuracy, reliability, precision, completeness, timeliness, integrity, confidentialityComprehensive audit in quantitative and qualitative methods including in-depth verifications at the service delivery sites; and follow-up verifications at the next level4 methods for selection of sites including purposive selection, restricted site design, stratified random sampling, random sampling; the time period corresponding to the most recent relevant reporting period for the IS. Five types of data verifications including description, documentation review, trace and verification (recount), cross-checks, spot-checks. Observation, interviews and conversations with key data quality officials were applied to collect dataDescriptive statistics on accuracy, availability, completeness, and timeliness of reported data, including results verification ratio of verification, percentage of each dimension, differences between cross-checkTwo protocols, 6 phases, 17 steps for the audit; sample on a limited scale considering the resources available to conduct the audit and level of precision desired; 2–4 indicators “case by case” purposive selection; on-site audit visits by tracing and verifying results from source documents at each level of the PHISConfined to specific disease context and standard program-level output indicatorsME PRISM 2010 [40]Relevance: comparing data collected against management information needs. Completeness: filling in all data elements in the form, the proportion of facilities reporting in an administrative area. Timeliness: submission of the reports by an accepted deadline. Accuracy: comparing data between facility records and reports, and between facility reports and administrative area databasesQuantitative method, Questionnaire survey including data completeness and transmission, data accuracy check, data processing and analysis, assess the respondent’s perceptions about the use of registers, data collection forms and information technologyNon-anonymous interviews with identified name and title, including asking, manual counting, observation and recording results or circling “yes or no”Using a data entry and analysis tool (DEAT), described in quantitative terms rather than qualitative. Yes or No tick checklistA diagnostic tool in forms measures strengths and weaknesses in three dimensions of data quality. Quantitative terms help set control limits and targets and monitor over timeIndicators are not all inclusive; tool should be adapted in a given context. Need pre-test and make adjustmentsPereira et al. 2012 [72]Completeness and accuracy of data-fields and errorsQuantitative and qualitative methods: Use primary (multi-center randomized trial) and secondary (observational convenience sample) studiesField visits of a sample of clinics within each PHU to assess barcode readability, method efficiency and data quality. 64 clinic staff representing 65% of all inventory staff members in 19 of the 21 participating PHUs completed a survey examining method perceptionsDescriptive statistics: a weighted analysis method, histograms, 95% confidence intervals, F-test, Bootstrap method, the two-proportion z-test, adjusted the p values using Benjamin–Hochberg’s method for controlling false discovery rates (FDR)The first study of such in an immunization setting.Lack of representativeness to multiple lot numbers. Inaccurate data entry was not examined. Observations were based on a convenience samplePetter and Fruhling 2011 [62]Checklist of system quality, information qualityQuantitative methods to use DeLone&McLean IS success model. Use a survey in structured questionnaireOnline survey, facsimile, and mail, using 7 Likert scale for all quantitative questions. A response rate of 42.7% with representative demographicsSummative score for each construct, and each hypothesis was tested using simple regression. Mean, standard deviation, the Spearman’s correlation coefficients for analysisDemonstrates the need to consider the context of the medical information system when using frameworks to evaluate the systemInability of assessing some correlational factors due to the small PHIS user systemRonveaux et al. 2005 [60]Consistency
The ratio of verified indicators reported compared with written documentation at health facilities and districtsQuantitative methods, using standardized data quality audits (WHO DQAs) in 27 countriesRecounted data compared to reported dataDescriptive statisticsA quantitative indication of reporting consistency and quality, facilitate comparisons of results over time or placeSimilar to WHO DQASaeed et al. 2013 [61]Completeness, validity, data management Calculation of missing data and illegal values (out of a predetermined range), data management (data collection, entry, editing, analysis and feedback)Quantitative and qualitative methods, including interview, consultation, and documentation review10 key informants interview among the directors, managers and officers; 1 or 2 staff at national level interviewed; consultation with stakeholders, document review of each system strategic plan, guidelines, manuals, annual reports and data bases at national levelPredefined scoring criteria for attributes: poor, average, or goodComparison of two PHISPurposive samplingSavas et al. 2009 [47]Sensitivity, specificity and the Kappa coefficient for inter-rater agreementQuantitative methods: audit data set by cross-linkage techniquesDatabases were deterministically cross linked using female sex and social security numbers. Deterministic and probabilistic linkage methods were also comparedDescriptive statisticsCombined electronic databases provide nearly complete ascertainment for specific datasetUsing data which were missing would affect the results by under-ascertainmentVan Hest et al. 2008 [46]Accuracy and completeness of reported casesQuantitative methods: audit data set by record-linkage and capture-recapture techniquesUse record linkage, false-positive records and correction, and capture-recapture analysis through 3 data sources by a core set of identifiersDescriptive statistics: number, proportion and distribution of cases, 95% ACI (Approximate confidence interval), Zelterman’s truncated modelRecord-linkage of TB data sources and cross-validation with additional TB related datasets improves data accuracy as well as completeness of case ascertainmentImperfect record-linkage and false-positive records, violation of the underlying capture–recapture assumptionsVenkatarao et al. 2012 [22]Timeliness: Percentage of the reports received on time every week; Completeness: percentage of the reporting units sending reports every weekQuantitative methods: Use field survey (questionnaire) with a 4-stage sampling method2 study instruments: the first focused on the components of disease surveillance; the second assessed the ability of the study subject in identifying cases through a syndromic approachDescriptive statistics analysisTwo instruments including surveying users and datasetNot able to assess the quality of data source such as accuracyWHO DQA 2003 [42]Completeness of reporting, report availability, timeliness of reporting, verification factorQuantitative methods to audit selected indicators in the dataset. Multi-stage sampling from stratified sample representing the country’s PHISRecounted data compared to reported dataDescriptive statisticsA systematic methodology to describe data quality in the collection, transmission and use of information, and to provide recommendations to address themSample size and the precision dictated by logistical and financial considerationsWHO DQRC 2013 [44]Completeness of reporting; internal consistency of reported data; external consistency of population data; external consistency of coverage ratesQuantitative method to conduct a desk review of available data and a data verification component at national level and sub-national levelAn accompanying Excel-based data quality assessment toolSimple descriptive statistics: percentage, standard deviationEasy to calculateNeeds WHO DQA to complement assessment of the quality of data sourceWHO HMN 2008 [45]Data-collection method, timeliness, periodicity, consistency, representativeness, disaggregation, confidentiality, data security, and data accessibility.Quantitative and qualitative methods to use 63 out of 197 questions among around 100 major stakeholdersUse consensus development method by group discussions, self-assessment approach, individual (less than 14) or group scoring to yield a percentage rating for each categoryAn overall score for each question, quartiles for the overall report.Expert panel discussion, operational indicators with quality assessment criteria.Sample size was dictated by logistical and financial considerationsOpen in a separate window
Table A2
Authors YearAttributes Major measuresStudy designData collection methodsData analysis methodsContributionLimitationsFreestone et al. 2012 [52]Trends in use Actioned requests from researchers in a set period of timeAnalysis of actioned requests from researchers in a period of timeAbstracted data from the database for the study periodTrend analysis of proportion of requestsQuantifiable measuresLimit attributesHahn et al. 2013 [50]Use of data
The usage of aggregated data for monitoring, information processing, finance and accounting, and long-term business decisionsQualitative methods: structured interviews with purposive sample of 44 staff and in-depth interviews with 15 key informantsStructured survey and key informant interview to assess five structured statements. Five-point scales were used for each statementResponses were processed manually, classified and grouped by facility and staff classIdentified indicators of use of dataLack of quantifiable results for assessment of data useIguiñiz-Romero and Palomino 2012 [70]Data use
Data dissemination: identify whether data used for decision making, the availability of feedback mechanismsQualitative exploratory study including interview and review of documentationsOpen-ended, semi-structured questionnaire interviews with 15 key decision-makers. Review national documents and academic publicationsInterview data recorded, transcribed, organized thematically and chronologically. The respondents were identified by positions but not namedMost respondents held key positions and a long period of the reviewed publicationsPurposive sample lack of representativenessMatheson et al. 2012 [71]Clinical use of data: the number of summaries produced.
Use of data for local activities to improve care.
Data entry: the number of active sites.
Report use: the percentage of active sites using prebuilt queries to produce data for each type of report in a given month over timeQualitative and quantitative methods: key informant interview, documentation review, database query.Personal interviews by phone and through internet telephony; follow up in person or by email; running SQL queries against the central database. External events were identified by reviewing news reports and through personal knowledge of the authorsDescriptive statistics using charts on number of clinics using the system in a given month, percentage of active clinicsMultiple methodsLack of verification of data sourceME PRISM 2010 [40]Checklist of use of information
Report production, display of information, discussion and decisions about use of information, promotion and use of information at each levelQuantitative method to complete a predesigned checklist diagnostic toolChecklist and non-anonymous interviewing staff, asking, manual counting, observation and recording results or circling “yes or no”Two Likert score and descriptive statisticsQuantitative terms help set control limits and targets and monitor over time
Petter and Fruhling 2011 [62]System use, intention to use, user satisfactionQuantitative methods to use DeLone & McLean IS success model. Survey respondents with a response rate of 42.7% and with representative demographicsUse an online survey in structured questionnaire with 7 Likert scale for all quantitative questions, in addition to facsimile and mailSummative score for each construct, and each hypothesis was tested using simple regression, in addition to mean, standard deviation, the Spearman’s correlation coefficientsUse is dictated by factors outside of the control of the user, and it is not a reasonable measure of IS success. The quality does not affect the depth of useLack of objective assessmentsQazi and Al 2011 [27]Use of data
Non-use, misuse, disuse of dataDescriptive qualitative interviewsIn-depth, face to face and semi structured interviews with an interview guide, 26 managers (all men, ages ranging from 26 to 49 years; selected from federal level (2), provincial (4) and seven selected districts (20) from all four provinces)Data transcription, analysis based on categorization of verbatim notes into themes and a general description of the experience that emerged out of statementsA qualitative study allows getting close to the people and situations being studied, identified a number of hurdles to use of dataConvenience sample only one type of stakeholders has been covered.Saeed et al. 2013Usefulness of the system
Data linked to action, feedback at lower level, data used for planning, detect outbreaks, data used for the development and conduct of studiesQuantitative and qualitative methods, including interview, consultation, and documentation review10 key informants interview; consultation with stakeholders, document review of each systemPredefined scoring criteria for attributes: poor, average, or goodMixed methodsPurposive samplingWHO HMN 2008 [45]Information dissemination and use, demand and analysis, policy and advocacy, planning and priority-setting, resource allocation, implementation and actionMixed methods: quantitative and qualitative. Use 10 out of 197 questions among stakeholders at national and subnational levelsUse group discussions (100 major stakeholders), self-assessment approach, individual (less than 14) or group scoring to yield a percentage rating for each categoryAn overall score for each question, quartiles for the overall reportExpert panel discussion, operational indicators with quality assessment criteriaLack of field verification of data useWilkinson and McCarthyExtent of data recognition and use, strategies and routines, specific uses, disseminationQuantitative and qualitative methods to use standardized semi-structured questionnaire telephone interviews of key informants from the management teams of the systemTelephone structured questionnaire interviews of 68 key informants from the 29 out of 34 management teams of the networks. Response options for most of the questionnaire items were yes/no or five or seven point Likert and semantic differential response scalesQuantitative and qualitative analysis of survey results. Qualitative data transcribed, ordered by question number, and common themes, then content analyzed to indicate frequencies and percentages. Correlational analyses used Pearson’s r for parametric data and Spearman’s Rho for non-parametric dataQuantification of qualitative dataStatistical analysis is limited by the size of the sample as there were only 29 networks and 68 individual participants, statistical power to detect an effect is weak, and general trends are mainly reported.Open in a separate window
Table A3
Authors YearAttributes Major measuresStudy designData collection methodsData analysis methodsContributionLimitationsAncker et al. 2011 [59]Group discussion about root causes of poor data quality and strategies for solving the problemsQualitative method by focus group discussionHeld a series of weekly team meetings over about 4 months with key informants involved in the data collectionTheme grouping to each data quality issueInitiated by and related to identified poor data quality issuesImplicitly focused. Only analyzed causes not assessed the magnitudeBosch-Capblanch et al. 2009 [58]Quality scores
Recording and reporting of data,
keeping of vaccine ledgers and information system designQuantitative method by user’s survey based on WHO DQA. A multistage weighted representative sampling procedureQuestionnaire based on a series of 19 questions and observations undertaken at each level (national, district and health units)Each question 1 point. Average score, summary score, medians, inter-quartile ranges, confidence intervals, P value, bubble scatter chart, Rho valueCombined with data qualityImplicitly focused, the number of questions surveyed was less than that of the WHO DQACIHI 2009 [30]Metadata documentation
Data holding description, methodology, data collection and capture, data processing, data analysis and dissemination, data storage, and documentation.Quantitative method by surveying usersQuestionnaireUndefined7 categories, with subcategories and definition and/or exampleImplicitly focusedCorriols et al. 2008 [55]Identification of underreporting reasons by reviewing information flow chart and non-reporting in physiciansQualitative method to review documentationsReview the national reports on the system related to deficiency in the information flow chart and non-reporting in physiciansUndefinedInitiated by identified data quality issuesImplicitly focusedDai et al. 2011 [69]Data collection, data quality management, statistical analysis and data disseminationQualitative method, review documentationsDocument reviewTheme groupingDesk reviewImplicitly focusedForster et al. 2008Routine data collection, training and data quality controlQuantitative method by online surveyQuestionnaireDescriptive statistics.Examine associations between site characteristics and data qualityImplicitly focused. Convenience sampleFreestone et al. 2012 [52]Data collection and recording processesQualitative method to review current processes about identification, code, geocode of address or location data. Staff consulted to establish and observe coder activities and entry processesReview the processes; consultation with staff; observation of coder activities and entry processes to identify any potential cause of errors which then grouped thematicallyThematically grouping dataIdentify each of the key elements of the geocoding process are factors that impact on geocoding qualityDifferences in software and system settings need to be aware of.Hahn et al. 2013 [50]Data flow The generation and transmission of health informationQualitative method to use workplace walkthroughs on 5 subsequent working days at each siteInformal observations of the generation and transmission of health information of all kinds for the selection of data flowsUndefinedObservation of walkthroughsUndefined indicatorsIguiñiz-Romero and Palomino 2012 [70]Data flow or data collection process: data collectors, frequencies, data flow, data processing and sharing,Qualitative exploratory study including interview and review documentationsOpen-ended, semi-structured questionnaire interviews with 15 key decision-makers. Review national documents and academic publicationsData recorded, transcribed, organized thematically and chronologicallyMost respondents held key positions and a long period of reviewed publicationsPurposive sampleLin et al. 2012 [65]Data collection and reportingQualitative methods based on CDC’s Guidelines,Review guidelines and protocols using a detailed checklist; direct observation; focus group discussions and semi-structured interviewsTheme groupingField visits or observations of data collection to identify impact on the data qualityUndefined indicatorsME DQA 2008 [34]Five functional areas: M&E structures, functions and capabilities, indicator definitions and reporting guidelines, data collection and reporting forms and tools, data management processes, and links with national reporting systemQuantitative and qualitative methods by 13 system assessment summary questions based on 39 questions from five functional areas. Score the system combined with a comprehensive audit of data qualityOff-site desk review of documentation provided by the program/project; on-site follow-up assessments at each level of the IS, including observation, interviews, and consultations with key informantsUsing summary statistics based on judgment of the audit team. Three-point Likert scale to each response. Average scores for per site between 0 and 3 continuous scaleDQA protocol and system assessment protocolImplicitly focused. The scores should be interpreted within the context of the interviews, documentation reviews, data verifications and observations made during the assessment.ME PRISM 2010 [40]ProcessesData collection, transmission, processing, analysis, display, quality checking, feedbackQuantitative method by questionnaire survey including data transmission, quality check, processing and analysis and assessing the respondent’s perceptions about the use of registers, data collection forms and information technologyNon-anonymous interviewing staff with identified name and title, including asking, observation and circling “yes or no”Using a data entry and analysis tool (DEAT), described in quantitative terms rather than qualitative. Yes or No tick checklistA diagnostic tool. Quantitative terms help set control limits and targets and monitor over timeIndicators are not all inclusive; tool should be adapted and pre-test and make adjustmentsRonveaux et al. 2005 [60]Quality index (QI)
Recording practices, storing/reporting practices, monitoring and evaluation, denominators used at district and national levels, and system design at national levelQuantitative and qualitative methods by external on-site evaluation after a multi-stage sampling based on WHO DQA.Questionnaires and observations. Survey at national level (53 questions), district level (38 questions) and health-unit level (31 questions). Observations to workers at the health-unit level. They were asked to complete 20 hypothetical practices.Descriptive statistics (aggregated scores, mean scores): 1 point each question or task observed. Correlational analyses by zero-order Pearson correlation coefficients
Implicitly focused. The chosen sample size and the precision of the results were dictated by logistical and financial considerationsVenkatarao et al. 2012 [22]Accuracy of case detection, data recording, data compilation, data transmissionQuantitative method by using a 4-stage sampling method to conduct field survey (questionnaire) during May-June 2005 among 178 subjectsQuestionnaires of 2 study instruments: the first focused on the components of disease surveillance; the second assessed the ability of the study subject in identifying cases through a syndromic approachDescriptive statistics analysisAssessment from user’s viewpoint.Implicitly focused. Lack of field verification of data collection processWHO DQA 2003 [42]Quality questions checklist, quality index Five components: recording practices, storing/reporting practices, monitoring and evaluation, denominators, system design (the receipt, processing, storage and tabulation of the reported data)Quantitative and qualitative method using questionnaire checklists for each level (three levels: national, district, health unit level) of the system including 45, 38, 31 questions respectivelyQuestionnaires and discussions. Observations by walking around the health unit for field observation to validate the reported valuesPercentage of the items answered yes. The target is 100% for each componentDescribe the quality of data collection and transmissionImplicitly focused. The chosen sample size was dictated by logistical and financial considerationsWHO HMN 2008 [45]Data management or metadata
A written set of procedures for data management including data collection, storage, cleaning, quality control, analysis and presentation for users, an integrated data warehouse, a metadata dictionary, unique identifier codes availableMixed methods: quantitative and qualitative. Use 5 out of 197 questions, at various national and subnational levelsUse group discussions around 100 major stakeholders, self-assessment approach, individual (less than 14) or group scoring to yield a percentage rating for each categoryAn overall score for each question, quartiles for the overall reportExpert panel discussion, operational indicators with quality assessment criteriaLack of field verification of data collection processOpen in a separate window
Author Contributions
PY conceptualized the study. HC developed the conceptual framework with the guidance of PY, and carried out the design of the study with all co-authors. HC collected the data, performed the data analysis and appraised all included papers as part of her PhD studies. PY reviewed the papers included and the data extracted. PY, DH and NW discussed the study; all participated in the synthesis processes. HC drafted the first manuscript. All authors made intellectual input through critical revision to the manuscript. All authors read and approved the final manuscript.
Conflicts of Interest
The authors declare no conflict of interest.