This article has Open Peer Review reports available.
On the validity of area-based income measures to proxy household income
© Hanley and Morgan; licensee BioMed Central Ltd. 2008
Received: 04 December 2007
Accepted: 10 April 2008
Published: 10 April 2008
This paper assesses the agreement between household-level income data and an area-based income measure, and whether or not discrepancies create meaningful differences when applied in regression equations estimating total household prescription drug expenditures.
Using administrative data files for the population of BC, Canada, we calculate income deciles from both area-based census data and Canada Revenue Agency validated household-level data. These deciles are then compared for misclassification. Spearman's correlation, kappa coefficients and weighted kappa coefficients are all calculated. We then assess the validity of using the area-based income measure as a proxy for household income in regression equations explaining socio-economic inequalities in total prescription drug expenditures.
The variability between household-level income and area-based income is large. Only 37% of households are classified by area-based measures to be within one decile of the classification based on household-level incomes. Statistical evidence of the disagreement between income measures also indicates substantial misclassification, with Spearman's correlations, kappa coefficients and weighted kappa coefficients all indicating little agreement. The regression results show that the size of the coefficients changes considerably when area-based measures are used instead of household-level measures, and that use of area-based measures smooths out important variation across the income distribution.
These results suggest that, in some contexts, the choice of area-based versus household-level income can drive conclusions in an important way. Access to reliable household-level income/socio-economic data such as the tax-validated data used in this study would unambiguously improve health research and therefore the evidence on which health and social policy would ideally rest.
Measures of income are often central to health and health policy research. Among many potential implications, income can be a non-medical determinant of health [1–3] an enabling factor for access to care, or a consideration when judging equity of policies and programs. As important as this variable may be, it is often difficult for health and health policy researchers to obtain reliable, individual-level income information for the populations they study. In the absence of individual-level income data, investigators often supplement health research datasets with group-based measures such as area-based average income constructed from national census data. Such measures are used as proxy for individual-level income data on the assumption that household incomes will be reasonably homogeneous within small enough residential areas. If, however, there is significant heterogeneity in the areas used, then the aggregate measures can result in ecological fallacy–when an association observed between variables at an aggregate level does not represent the association that exists at an individual level.
Prior studies have investigated misclassification of income and other socio-economic variables by comparing individual versus area-level survey responses for small samples of the population[8, 9] and by comparing survey-based measures for different sized census areas[10, 11]. Using a unique dataset that contains validated household income data for approximately 78% of the population of British Columbia (BC), we investigate the level of misclassification that can occur when census-defined, area-based income is used as a proxy for an individual's actual household-level income. As the question of most interest to researchers concerns how well aggregate variables perform when they are entered in health outcomes equations, we then assess the sensitivity of the analysis of health related inequities in total prescription drug costs to whether income is measured as an area-based variable or an household-level variable.
Our primary datasets are administrative files for the provincially administered, universal public medical and hospital health insurance program, Medical Services Plan (MSP) of BC. This program covers virtually all 4.2 million residents of BC, excluding only those residents covered by federal health insurance programs (collectively about 4% of the population). We restrict our attention to households for which one or more member resided in BC for at least 275 days per year from 2001 to 2004, inclusive.
Household income was obtained from the 2004 registration files for provincially administered, universal public pharmaceutical insurance program, BC PharmaCare. In addition to programs for social assistance recipients and other select populations, BC PharmaCare began offering income-based public drug coverage to all residents of the province in May 2003. Terms such as deductibles and co-insurance are based on household income, with more generous but still income-based coverage offered to senior citizens (residents aged 65 and older). For all households that registered to receive coverage, the BC Ministry of Health obtains net, pre-tax income information from the Canada Revenue Agency. Because of differences in coverage offered and average needs, 95% of households with one or more senior member were registered for Fair PharmaCare in 2004 whereas only 73% of non-senior households were registered.
The area-based income variables used in this study are based on linking MSP registry postal codes to average household income in the area as recorded in the 2001 Census. Statistics Canada collates average household income and composition for over 7,000 Census Dissemination Areas comprised of 400 to 700 persons. For research purposes, these areas are sorted by income and aggregated into 1,000 strata. Income strata contain an average of 1,700 households, with some variation due to variations in populations by postal code. Both the household level and area-based income variables are based on the same income concept, gross income prior to any deductions.
Total individual expenditures on prescription drugs were obtained from BC PharmaNet. BC PharmaNet is an administrative dataset in which every prescription dispensed in the province must be entered by law–it is designed to support drug dispensing, drug monitoring and claims processing. These individual expenditures were aggregated at the household level according to registration files for the MSP program to create a variable indicating total household spending on prescription drugs.
The research data were extracted for this study from the British Columbia Linked Health Database and the BC PharmaNet database with permission of the BC Ministry of Health and the College of Pharmacists of BC. Ethics approval was obtained from the Behavioural Research Ethics Board at the University of British Columbia.
The household-specific and area-based income measures were each aggregated into deciles (ordered from lowest to highest income). We assess the discrepancy between the two measures using the CRA validated, household-specific incomes as the standard. We calculated the Spearman's rank correlations of the various income measures, and both the kappa and weighted kappa to measure the degree of non-random agreement and partial agreement between the measures.
We proceed to examine whether the choice of income measure has an impact on how pharmaceutical expenditures are distributed by income status. We begin by examining the distribution of prescription drug expenditures by income deciles, where the deciles are defined according to household-level income then according to neighbourhood level income. As measurement error is accommodated more easily in regression analysis than in descriptive analysis, we also include a series of dummy variables for both versions of the income variable in an OLS regression in order to determine whether both area-based income and household income generate meaningfully different results when applied in a research context. We perform regressions of income on total drug expenditures with and without covariates controlling for the presence of one or more seniors in the household as well as household size. Through the comparison of coefficients between household-level income variables and area-level income variables, one can reach some conclusions about the appropriateness of substituting an area-based measure for a missing household-level variable in a regression equation. By including regressions with and without covariates, we can determine whether multivariate models influence the discrepancy between area-based and household-level variables.
Entire BC population, 2003. Agreement between household-level validated income deciles and area-based income deciles
Household-level validated income decile
Area-based income decile
Percentage of discrepancy by decile between area-based and household-level income measures
Actual household-level income
Spearman's correlation, Kappa and weighted Kappa coefficients for the association between the area-based income measures and the household income measure
Actual household-level income
Best-case Scenario (including non-registrants)
Total drug costs by income decile
Entire BC population
Mean total drug costs
Percent of total drug costs
Mean total drug costs
Percent of total drug costs
Mean total drug costs
Percent of total drug costs
Deciles measured by CRA validated income
Deciles measured by neighbourhood income
Results for the regression of dummy variables indicating income decile against total drug costs
Household income (without covariates)
Neighborhood Income (without covariates)
Income decile 1
Income decile 2
Income decile 3
Income decile 4
Income decile 5
Income decile 6
Income decile 7
Income decile 8
Income decile 9
Presence of seniors
We found a sufficient level of discrepancy between the area-based and household-level income measures. Using validated household income as the standard, area-based measures misclassified the income decile for eighty-five percent or more of the households in the data. We also found that these discrepancies did affect the size of coefficients in regression analyses, suggesting that very different conclusions can be reached regarding the 'same' issue depending on which income variable we use. Thus, these results indicate that, at least in some contexts, the choice of neighbourhood versus household income can drive conclusions in an important way. Our results are consistent with a large amount of work indicating substantial discrepancy between area-based and household SES measures[2, 6, 8, 10].
There are also a couple of important caveats. The first is that our study did not examine the inclusion of income as simply one of several control variables, but rather only looked at the difference between household-level and area-level income when applied as the primary variable of interest. Thus, results cannot be extended to the use of income as a control in much larger regression equations. Second, these results are not meant to suggest that the use of neighbourhood income is inferior in all contexts. An author particularly concerned with measuring permanent income free of yearly fluctuations may find that neighbourhood income provides a better measure. When measuring access to health care, it might also be true that low-income families living in high-income neighbourhoods have better access to care than other similar low-income families simply because of where they live. Thus, an argument could be made for including both measures in this type of work.
While the level of agreement between area-based and household-level SES measures has frequently been studied, our work adds to the knowledge base for several reasons. It encompasses a larger number of Canadians, a sample of 78% of all households in British Columbia, of which 95% of all senior households are analyzed. Also, while other studies have tended to compare area-based measures to household-level survey data[6, 8, 9] or have compared two or more different sized area-based measures[10, 11] we have used highly reliable household-level income data validated with the Canada Revenue Agency. Therefore, we have been able to avoid all self-reporting bias, we have a great deal of confidence in our household-level income variable, and we have been able to analyze almost an entire population of a Canadian province.
While many authors have argued that household-level income should be used whenever possible, census-based aggregate measures will continue to be necessary for health research until household-level data become more readily available. Two suggestions can be made based on these research results. The first is that researchers should be cautious when interpreting the results of studies using aggregate measures as proxies for individual and household income. Area-based measures are approximations that are best suited to investigating major differences in incomes (e.g., differences of two or more quintiles) or to studying context in which someone lives rather than their specific income. The second suggestion is perhaps obvious to researchers but important for governments and statistical agencies to fully understand: access to reliable individual-level income/socio-economic data, as well as the neighbourhood level income data that is currently available, would unambiguously improve health research and therefore the evidence on which health and social policy would ideally rest.
This research was supported by a CIHR operating grant. Steve Morgan is supported by a New Investigator award from the Canadian Institutes of Health Research (CIHR) and Scholar Award from the MSFHR.
- Adler NE, Boyce WT, Chesney MA, Folkman S, Syme SL: Socioeconomic inequalities in health. No easy solution. JAMA. 1993, 269: 3140-3145. 10.1001/jama.269.24.3140.View ArticlePubMedGoogle Scholar
- Braveman PA, Cubbin C, Egerter S, Chideya S, Marchi KS, Metzler M, Posner S: Socioeconomic Status in Health Research One Size Does Not Fit All. JAMA. 2005, 294: 2879-2888. 10.1001/jama.294.22.2879.View ArticlePubMedGoogle Scholar
- Marmot MG, Rose G, Shipley M, Hamilton PJ: Employment grade and coronary heart disease in British civil servants. Br Med J. 1978, 32 (4): 244-9.Google Scholar
- van Doorslaer E, Masseria C, Koolman X: Inequalities in access to medical care by income in developed countries. Can Med Assoc J. 2006, 174: 177-183. 10.1503/cmaj.050584.View ArticleGoogle Scholar
- Culyer AJ: Health, Health Expenditures and Equity. 1991, University of York, Centre for Health EconomicsGoogle Scholar
- Geronimus AT, Bound J, Neidert LJ: On the Validity of Using Census Geocode Characteristics to Proxy Individual Socioeconomic Characteristics. J Am Stat Assoc. 1996, 91: 529-537. 10.2307/2291645.View ArticleGoogle Scholar
- Last JM: A Dictionary of Epidemiology. 1995, Oxford: Oxford University PressGoogle Scholar
- Demissie K, Hanley JA, Menzies D, Joseph L, Ernst P: Agreement in measuring socio-economic status: area-based versus individual measures. Chronic Dis Can. 2000, 21 (1): 1-7.PubMedGoogle Scholar
- Diez-Roux AV, Kiefe CI, Jacobs DR, Haan M, Jackson SA, Nieto FJ, Paton CC, Schulz R: Area characteristics and individual-level socioeconomic position indicators in three population-based epidemiologic studies. Ann Epidemiol. 2001, 11: 395-405. 10.1016/S1047-2797(01)00221-6.View ArticlePubMedGoogle Scholar
- Geronimus AT, Bound J: Use of census-based aggregate variables to proxy for socioeconomic group: evidence from national samples. Am J Epidemiol. 1998, 148: 475-486.View ArticlePubMedGoogle Scholar
- Southern DA, Ghali WA, Faris PD, Norris CM, Galbraith PD, Graham MM, Knudtson ML: Misclassification of income quintiles derived from area-based measures: A comparison of enumeration area and forward sortation area. Can J Public Health. 2002, 93: 465-469.PubMedGoogle Scholar
- The pre-publication history for this paper can be accessed here:http://www.biomedcentral.com/1472-6963/8/79/prepub
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.