Hierarchical Bayes estimation of small area means using statistical linkage of disparate data sources

We propose a Bayesian approach to estimate finite population means for small areas. The proposed methodology improves on the traditional sample survey methods because, unlike the traditional methods, our proposed method borrows strength from multiple data sources. Our approach is fundamentally different from the existing small area Bayesian approach to the finite population sampling, which typically assumes a hierarchical model for all units of the finite population. We assume such model only for the units of the finite population in which the outcome variable is observed; because for these units, the assumed model can be checked using existing statistical tools. Modeling unobserved units of the finite population is challenging because the assumed model cannot be checked in the absence of data on the outcome variable. To make reasonable modeling assumptions, we propose to form several cells for each small area using factors that potentially influence the outcome variable of interest. This strategy is expected to bring some degree of homogeneity within a given cell and also among cells from different small areas that are constructed with the same factor level combination. Instead of modeling true probabilities for unobserved individual units, we assume that population means of cells with the same combination of factor levels are identical across small areas and the population mean of true probabilities for a cell is identical to the mean of true values for the observed units in that cell. We apply our proposed methodology to a real-life COVID-19 survey, linking information from multiple disparate data sources to estimate vaccine-hesitancy rates (proportions) for 50 US states and Washington, D.C. (small areas). We also provide practical ways of model selection that can be applied to a wider class of models under similar setting but for a diverse range of scientific problems.

翻译：我们提出一种贝叶斯方法，用于估计小域有限总体均值。与传统抽样调查方法相比，本方法通过从多个数据源借力实现改进。现有有限总体抽样的小域贝叶斯方法通常假设所有有限总体单元服从层次模型，而我们的方法具有本质区别：仅对观测到结果变量的有限总体单元建立上述模型——因为这些单元的可识别模型可通过现有统计工具检验。对未观测单元建模极具挑战性，因为缺乏结果变量数据时无法检验假设模型。为建立合理模型假设，我们提出利用可能影响目标结果变量的因素将每个小域划分为若干单元格，该策略预期可提升单元格内部及不同小域间（相同因子水平组合所构建单元格）的均匀性。我们不对未观测个体的真实概率建模，而是假设：具有相同因子水平组合的单元格总体均值在不同小域间一致，且单元格真实概率的总体均值与其观测单元的真实值均值相等。将所提方法应用于真实新冠肺炎调查，通过关联多个异构数据源信息，估计美国50个州及华盛顿特区（小域）的疫苗犹豫率（比例）。同时提供适用于类似设定下更广泛模型类的实用模型选择方法，可推广至多样化的科学问题场景。