Comparison between different instruments for measuring health-related quality of life in a population sample, the WHO MONICA Project, Gothenburg, Sweden: an observational, cross-sectional study
Health-related quality of life (HRQoL) measurements are frequently asked for in clinical trials and controlled studies. Very few comparative, methodological studies of HRQoL instruments have been published on population samples. This study reports the results of three different HRQoL instruments from the general population of over 400 Swedish subjects with a largely complete data set.
The general aim was to meet the need for empirical comparative studies of HRQoL assessment instruments, by evaluating and comparing the psychometric properties and results of three different, widely used, generic HRQoL instruments in a population sample. A specific aim was to evaluate the subscales of the different instruments that measure the same HRQoL domain. The hypothesis was that there would be a high concordance between similar subscales in the different instruments. Another specific aim was to assess the association between the HRQoL instruments and an easily administered single-item self-rated health scale. The hypothesis was that the self-rated health scale is strongly associated with all the domains of HRQoL.
Studies done in Dutch population samples in 1996 8 and 1997 9 and in a Brazilian population sample in 2011 10 are examples of studies that have applied different HRQoL instruments. All aimed to compare the reliability of scores, to assess the discriminative ability of potential outcome measures applied in a general population sample and to assess the extent of agreement between the different instruments. The authors concluded that it is important to define one’s research question and underlined the need for careful consideration when choosing among HRQoL instruments. However, this is difficult when head-to-head analyses of different instruments with overlapping purposes are so rare. The HRQoL instruments compared in this study are the Nottingham Health Profile (NHP), Psychological General Well-Being Index (PGWB) and the Medical outcomes study Short Form-36 (SF-36). All of the instruments reflect the HRQoL domains outlined above.
There is a growing number of HRQoL measurement instruments available to researchers, and their sophistication, variety and scope is increasing. Since comparisons between clinical groups and population samples are common, it is important that the HRQoL instruments used are reliable and valid in the population. However, few studies apply different instruments and compare the results, and even fewer do so in general population samples. A meta-analysis planned by Lorente et al aims to evaluate HRQoL instruments indicating the need for such comparisons. 7
HRQoL is, by nature, subjective and a multidimensional approach must be taken to encompass physical and occupational function, psychological state, social interaction and somatic sensation caused by an illness and its consequent therapy on a patient. 6 HRQoL instruments are generally used to quantify health into health dimensions, or domains, such as mobility, ability to perform certain activities, emotional state, sensory function, cognition, social function and freedom from pain. 1
Health-related quality of life (HRQoL) is an important variable in clinical practice and in the medical literature with significant consequences for patients and society. As the general population ages and treatments become more advanced, widespread and expensive, interest has grown in evaluating medical treatments using patient-reported outcome measures, such as self-assessed HRQoL, as key variables. 1 2 HRQoL has become an integral part of medical clinical research in all disciplines, and is even seen as a hard end-point, alongside survival. 3 4 However, a major challenge has been to find widely accepted definitions of HRQoL. 5
In order to compare the results between the instruments NHP, PGWB, SF-36 and the self-rated health scale, the authors identified six domains that were conceptually similar: social functioning, pain, physical functioning, mental health, vitality and general health, and the summary scores. This categorisation was made based on the content in the items themselves and supported by previously published studies using these instruments 10 24–27 ( table 1 ).
Missing values were imputed in NHP questionnaire if less than 80% of the values were missing in a given subscale. In these 20 instances, the median value was calculated and imputed. Imputing was considered unnecessary when analysing the PGWB and the self-rated health scale because the sample size was large and missing answers were not common.
All statistical analyses were calculated using Statistical Package for the Social Sciences (V.24) software or Microsoft Excel. A p value of <0.01 was chosen to reduce the risk of type II error. SF-36 scores were calculated using scoring software obtained from Optum (license number QM03712) and, mental and physical component scores were calculated using 1998 US norms. NHP scores were reversed for consistency with the other instruments to facilitate comparisons.
Descriptive statistics for each of the instrument’s subscales including mean, median, SD and percentage of subjects with lowest (floor effect) and highest (ceiling effect) possible scores were calculated. The non-parametric Mann-Whitney U test was used to comapare continuous vairables, since the results were not normally distributed. The standardized mean effect size was calculated using Cohen’s d test (mean difference divided the pooled variance), d>0.25 was considered educationally significant and d>0.5 was considered clinically significant. 23 Internal consistency was examined using Cronbach’s α, α>0.70 was considered acceptable. Correlation analyses between the instruments were focused on comparing the conceptually similar dimensions between the instruments used. Spearman’s rho correlations (r s ) were used to analyse discriminant validity since the results were not normally distributed. Correlation coefficients were considered weak if r s <0.30, moderate if r s =0.30–0.49 and strong if r s ≥0.50. Regression analysis using the R 2 coefficient of determination was also calculated for certain subscale comparisons. The presence of self-rated ill-health was defined using the self-rated health scale score split at the median. All scores below the median value were categorized as self-rated ill-health.
Age in whole years and sex were determined using the Swedish personal identity number on the day of the visit. Information about education level was recorded in whole years from the first grade, according to the subject.
Self-rated health was measured with a single question. Subjects were asked to rate their current health status between 0 and 100 on a linear analogue self-assessment scale; 0 being the worst conceivable level and 100 being the best conceivable level. The item is identical to question number six published in the 1990 edition of the EuroQoL-5 Dimension questionnaire, EQ-5D. 19 Such single-item health indicators have consistently been shown to be strong correlates of objective health and even as predictors of mortality. 20–22
The SF-36 is a multipurpose health survey comprised of 36 items where a high score represents a better HRQoL. 17 It yields an eight-scale profile of functional health and well-being: physical functioning, role physical, bodily pain, general health, vitality, social functioning, role emotional and mental health (range for all 0–100). It also generates psychometrically based physical and mental health summary measures: a mental component summary and a physical component summary. The mental component summary is comprised of the subscales for vitality, social functioning, role emotional and mental health, whereas the physical component summary is comprised of the subscales for physical functioning, role physical, bodily pain and general health. The SF-36 has been proven useful in surveys of general and specific populations, comparing the relative burden of diseases, and in differentiating the health benefits produced by a wide range of different treatments. 18
The PGWB was designed to measure personal affective or emotional states reflecting a sense of well-being or distress intended for use in community surveys. 15 The PGWB includes 22 items, with a six-grade Likert style response format where a high score represents a better HRQoL. The scores are summarised into an overall well-being score (PGWB total score, range 22–132), and is also divided into six subscales: anxiety (range 5–30), depressed mood (range 3–18), positive well-being (range 4–24), self-control (range 3–18), general health (range 3–18) and vitality (range 4–24). The PGWB has been used in clinical trials and has performed well in both population-based and mental health samples. 16
NHP measures aspects of subjective health using a two-part questionnaire. 13 In this study, the NHP part I was used. Part I is comprised of 38 statements covering six dimensions concerning distress or limitations of activity: physical mobility, pain, sleep, energy, social isolation and emotional reactions. The response format is yes or no, dimension scores range from 0 to 100 and each statement is weighted according to the level of severity. The higher the score, the greater the limitations/distress, that is, the lower HRQoL. The NHP was developed in the 1980s but is still widely used, especially in Europe. It is useful because of its breadth and simplicity and is a suitable instrument for use in clinical practice and in populations where there are likely to be people with disabilities. 14
The subjects completed the questionnaires while visiting the Sahlgrenska University Hospital, Gothenburg, for medical examinations. After blood sampling, all subjects received breakfast during which the questionnaires were administered in the following order: NHP, PGWB, SF-36 and a single-item self-rated health scale. A single operator performed the measurements and administrations on all subjects. No personal guidance was given except for the instructions.
In 1995, 2592 individuals (age 25–64, 50% women) were recruited from the Gothenburg city census, which is kept up to date within a maximum of 14 days. This was the third population screening by the WHO MONICA-GOT (WHO MONItoring of trends and determinants for CArdiovascular disease GOThenburg) in which 1618 individuals participated. 11 The non-attenders in 1995 could not participate due to travel, living abroad, unwillingness to attend or inability to attend due to the illness of a relative. The subjects were examined at a medical clinic. A randomly selected subset of these subjects (every fourth subject, and all of the women aged 45–64 years, in total 662) underwent extra testing and they were invited for re-evaluation and assessment of HRQoL in 2008. 12 Of these subjects, 495 responded, 97 were deceased, 13 could not be traced and 57 did not reply. Sixty-four declined consent to participate and 17 did not come to the clinic. In total, 414 subjects completed the HRQoL questionnaires. Two subjects were excluded because of incomplete data, leaving 412 subjects who were included in the analysis (62% participation rate, 77% women, age range 39–78 years).
To compare the ability of the PGWB, the NHP and the SF-36 instruments to discriminate subjects on the basis of health, the presence of ill-health was defined as self-rated health scale <80 (median score). All of the subscales could significantly differentiate the presence of self-perceived ill-health (p<0.001) and the effect sizes for all subscales were above the threshold to be considered clinically significant (Cohen’s d>0.5) ( table 4 ).
Scatterplot diagrams were used to examine and visualise the relationships between some of the subscales ( figure 3 ). The social functioning domain ( figure 3A ) showed an R 2 coefficient of 0.18 for the NHP versus the SF-36, meaning that only approx. 18% of the variation in social functioning measured with the NHP is described by the change in the same dimension measured with the SF-36. The correlation between the PGWB total score and the SF-36 mental component summary was strong and the linear relationship is the highest of all the comparisons tested, with an R 2 =0.65 ( figure 3B ). The general health domain also showed a strong correlation between all three instruments with the highest R 2 coefficient between the self-rated health scale and the SF-36 general health (R 2 =0.58) ( figure 3C–E ).
Correlations between the self-rated health scale and the general health subscales in the PGWB and the SF-36 were strong. The associations between the self-rated health scale and the PGWB total score and the SF-36 physical component summary and mental component summary were also strong. It is notable that there were no weak correlations between the self-rated health scale and any of the other instruments’ subscales ( table 3 ).
The correlations were positive and strong in the subscales between the PGWB, SF-36 and NHP, respectively, in the domains they had in common (mental health, p<0.01; vitality, p<0.01). The PGWB subscales were more strongly associated with the SF-36 subscales than with the NHP subscales. Furthermore, the associations between the PGWB and the SF-36 were stronger than the associations between SF-36 and NHP within these domains. The PGWB total score was associated with both the SF-36 summary scores but the association was weaker with the physical component summary than with the mental component summary.
The results found in the comparable dimensions of the SF-36 and NHP are shown in figure 2 . There were positive correlations between all the similar subscales of the SF-36 and NHP (physical functioning, pain, vitality, social functioning and mental health, all p<0.01) ( table 3 ).
Internal consistency coefficients for all instruments are shown in table 2 . The NHP yielded lower internal consistency estimates than the other two instruments (NHP mean ⍺=0.77, range 0.66–0.87; PGWB mean ⍺=0.85, range 0.76–0.90; SF-36 mean ⍺=0.86, range 0.83–0.91). Two of the subscales in the NHP fell below the standard recommended ⍺>0.70 for group comparisons (social isolation and sleep). All of the eight SF-36 subscales and four of the six subscales in the PGWB had ⍺-coefficients >0.80.
The distribution of the results was skewed for all of the instruments ( figure 1 ). The ceiling effect was most prominent in the NHP, in which 43%–84% of the respondents scored at the ceiling in the different subscales. The highest proportion of respondents scoring at the ceiling in the NHP subscales was in the subscales social isolation (84%), energy (70%) and physical mobility (66%). The highest ceiling effects in the SF-36 were seen in the subscales role emotional (69%), role physical (60%) and social functioning (59%). The highest proportion of ceiling scores in PGWB was seen in the subscale depressed mood (42%). The self-rated health scale was the least skewed of all the instruments used and only 5.3% reported the highest possible score of 100 ( table 2 ).
Descriptive statistics for each of the HRQoL instruments are presented for the whole group in table 2 . Men and women scored similarly in all the NHP subscales and the self-rated health scale. There were statistically significant differences between the sexes in some of the PGWB and SF-36 subscales, but further analysis to determine the effect size showed none of these differences to be of clinical significance (Cohen’s d range 0.2–0.4) (data not shown).
The mean age of the subjects who were included in the analysis (n=412) was 62.8 years, range 39–78. Seventy-seven per cent were women with a mean age of 63.7 years, the men had a mean age of 59.6 years (p<0.001). The average number of school years was 12, no significant difference was found between men and women (data not shown). Most of the subjects (>90%) had been employed but were retired at the time of this investigation.
Discussion
The general aim of the study was to examine and compare the psychometric properties of three generic HRQoL instruments—the NHP, the SF-36 and the PGWB—and their association to the self-rated health scale when used in a general population sample. The instruments showed strong reliability and discriminative ability, and the subscales measuring the same HRQoL domain showed strong associations (mainly rs>0.60), except in the social functioning domain. The distributions were skewed with considerable ceiling effects, which is to be expected when measuring HRQoL in a general population sample. All instruments differentiated between individuals with poor and good health.
It is widely accepted from psychometric literature that an HRQoL-measurement’s quality can be judged on the reliability, stability, prominence of ceiling/floor effects and validity.29 Stability was not tested here because of the cross-sectional nature of this study, but by the other criteria mentioned, the SF-36 and the PGWB performed equally well and both performed slightly better than the NHP. The PGWB had equivalent internal consistency to the SF-36, and had the least prominent ceiling and floor effect of the HRQoL instruments used.
Strong correlations were found between the PGWB and the SF-36 in the mental health, general health and vitality domains as well as between the PGWB total scores and SF-36 mental component summary. A study on patients with asthma also found a high correlation between the SF-36 mental component summary and the PGWB total score, and concluded that administering the PGWB together with the SF-36 would be redundant.30 Another study in patients with amyotrophic lateral sclerosis that focused solely on the mental health subscales in these two instruments found that internal consistency was equivalent and that all the PGWB subscales correlated strongly with the SF-36 mental health subscale.31 In the present study, a more nuanced approach was taken by comparing several similar subscales and not singling out mental health. These results support earlier recommendations to choose one or the other, particularly when the goal is to assess mental health, general health or vitality in a population sample in which the majority of the subjects do not have a chronic disease.
The PGWB had the same ability to discriminate the presence of self-rated ill-health as the SF-36 and the NHP. However, the PGWB should perhaps not stand alone if the aim is to assess HRQoL in a population since it does not meet the customary criteria for an HRQoL instrument.32 It does, on the other hand, contain aspects of positive well-being that the others may miss.16 These results are important, first because this is the only study, to the best of our knowledge, that compares the PGWB to the SF-36 applied in a general population sample, and second, because there is a lack of validity studies for the PGWB.14
In the present study, the SF-36 had a higher internal consistency, less prominent floor/ceiling effects and less skewed results than the NHP. These results support earlier findings that the SF-36 performs better than the NHP in population samples.8 9 The congruity between the two instruments was weakest in the social isolation domain. The SF-36 social functioning subscale was more strongly associated with the NHP subscales for emotional reactions and energy—much like previous findings.26 27 The items in the social functioning domain differ considerably in their content, which may explain this result.33 The SF-36 includes two questions on how/if physical and/or mental problems affect social interactions. The NHP includes five items on loneliness, social interactions, close friends and a feeling of being a burden to others, but without the specific connection to physical or mental symptoms. Notably, for the four remaining common domains, covering both mental and physical aspects of HRQoL, each pair of NHP and SF-36 scales were strongly correlated. Both instruments had the same ability to discriminate the presence of self-rated ill-health.
The only previous population-based comparison, to the best of our knowledge, between the SF-36 and the NHP was performed by Faria et al in community-dwelling subjects in Brazil with a mean age of 70 years.10 The results were mainly similar regarding internal consistency and convergent validity. Faria et al concluded that the SF-36 may be slightly favourable for use in a group of community-dwelling elders because of the prominent ceiling effects seen in NHP. Like Prieto et al, who studied patients with lung disease, we only found small differences between the instruments. It is questionable whether the small differences are clinically relevant even when the instruments are applied in a population sample.27
In this study, the self-rated health scale correlated, not only with similar subscales in the general health domain, but with all the other instruments’ subscales. The correlations between the self-rated health scale and the NHP were the least pronounced of all the comparisons made, with moderate correlations for the subscales measuring the domains of sleep, pain, physical mobility and social isolation. As expected, this population sample did not report a high level of problems in these domains using the NHP. This makes it reasonable to conclude that these domains, when measured with the NHP, were not a major cause of distress for the subjects, and did not strongly affect how they rated their health with the self-rated health scale.
The self-rated health scale could be considered a measure of overall HRQoL even in general population samples when the need for quick and easy administration is pertinent.14 34 35 However, a single-item self-rated health measurement cannot be seen as a substitute for multi-item questionnaires when more specific information about specific domains, such as mental functioning, sleep and pain, for example, are required.
Strengths and limitations
Very few comparative studies of HRQoL measurements have been published on population samples. This study reports the results from more than 400 subjects with a largely complete data set collected in 2008. However, the inclusion of middle-aged, mainly retired, predominantly female subjects may have led to selection bias and also affects the generalizability of the sample even if the follow-up rates were high. The conclusions about the discriminant validity of the instruments must also be drawn with care since the definition of ill-health was self-rated using the self-rated health scale. Another limitation is the cross-sectional design that makes it impossible to report on the responsiveness of the instruments. Content validity, structural validity or measurement error were not evaluated either, and are all important criteria when evaluating HRQoL instruments.36 The order in which the instruments were administered could have resulted in a ‘context effect-bias’. However, all subjects completed the questionnaires in the same order minimising the risk for systematic error.20