Measuring average differences in an outcome across racial or ethnic groups is a crucial first step for equity assessments, but researchers often lack access to data on individuals' races and ethnicities to calculate them. A common solution is to impute the missing race or ethnicity labels using proxies, then use those imputations to estimate the disparity. Conventional standard errors mischaracterize the resulting estimate's uncertainty because they treat the imputation model as given and fixed, instead of as an unknown object that must be estimated with uncertainty. We propose a dual-bootstrap approach that explicitly accounts for measurement uncertainty and thus enables more accurate statistical inference, which we demonstrate via simulation. In addition, we adapt our approach to the commonly used Bayesian Improved Surname Geocoding (BISG) imputation algorithm, where direct bootstrapping is infeasible because the underlying Census Bureau data are unavailable. In simulations, we find that measurement uncertainty is generally insignificant for BISG except in particular circumstances; bias, not variance, is likely the predominant source of error. We apply our method to quantify the uncertainty of prevalence estimates of common health conditions by race using data from the American Family Cohort.
翻译:衡量不同种族或族裔群体在结果上的平均差异是公平评估的关键第一步,但研究者通常缺乏获取个体种族或族裔数据的途径来计算此类差异。常见解决方案是利用代理变量对缺失的种族或族裔标签进行归因,随后使用这些归因结果来估计差异。传统标准误无法准确表征此类估计的不确定性,因其将归因模型视为给定且固定的,而非一个需通过不确定性估计的未知对象。我们提出一种双自助法,通过显式考虑测量不确定性实现更精确的统计推断,并通过模拟验证其效果。此外,我们将该方法适配至广泛使用的贝叶斯改进姓氏地理编码(BISG)归因算法——由于底层人口普查局数据不可获取,直接自助法在此场景下不可行。模拟结果表明,除特定情形外,BISG的测量不确定性通常可忽略不计;偏差而非方差可能是主要误差来源。我们应用该方法,利用美国家庭队列数据量化按种族划分的常见健康状况患病率估计的不确定性。