We introduce a new method for estimating the mean of an outcome variable within groups when researchers only observe the average of the outcome and group indicators across a set of aggregation units, such as geographical areas. Existing methods for this problem, also known as ecological inference, implicitly make strong assumptions about the aggregation process. We first formalize weaker conditions for identification which hold conditionally on covariates. To efficiently control for many covariates, we propose a debiased machine learning estimator that is based on nuisance functions restricted to a partially linear form. Our estimator admits a semiparametric sensitivity analysis which allows researchers to evaluate the impact of violations of the key identifying assumption. We also propose a nonparametric test for the identifying assumption itself. Finally, we derive asymptotically valid confidence intervals for local, unit-level estimates under additional assumptions. Simulations and validation on real-world data where ground truth is available demonstrate the advantages of our approach over existing methods. Open-source software is available which implements the proposed methods.
翻译:我们提出一种新方法,用于在研究者仅能观测到一组聚合单元(如地理区域)内结果变量与群体指示变量的平均值时,估计组内结果变量的均值。针对这一被称为生态推断的问题,现有方法隐式地假设了强化的聚合过程。我们首先形式化了在协变量条件下成立的较弱的识别条件。为高效控制众多协变量,我们提出一种基于受限为部分线性形式的冗余函数构建的去偏机器学习估计量。该估计量支持半参数敏感性分析,使研究者能够评估关键识别假设违反时的影响。我们还提出了针对识别假设本身的非参数检验方法。最后,在额外假设下,我们推导出局部单元级估计量的渐近有效置信区间。基于真实数据(存在真实基准值)的模拟与验证表明,我们的方法相较于现有方法具有优势。我们还提供了实现所提方法的开源软件。