Doubly-robust and heteroscedasticity-aware sample trimming for causal inference

A popular method for variance reduction in observational causal inference is propensity-based trimming, the practice of removing units with extreme propensities from the sample. This practice has theoretical grounding when the data are homoscedastic and the propensity model is parametric (Yang and Ding, 2018; Crump et al. 2009), but in modern settings where heteroscedastic data are analyzed with non-parametric models, existing theory fails to support current practice. In this work, we address this challenge by developing new methods and theory for sample trimming. Our contributions are three-fold: first, we describe novel procedures for selecting which units to trim. Our procedures differ from previous work in that we trim not only units with small propensities, but also units with extreme conditional variances. Second, we give new theoretical guarantees for inference after trimming. In particular, we show how to perform inference on the trimmed subpopulation without requiring that our regressions converge at parametric rates. Instead, we make only fourth-root rate assumptions like those in the double machine learning literature. This result applies to conventional propensity-based trimming as well and thus may be of independent interest. Finally, we propose a bootstrap-based method for constructing simultaneously valid confidence intervals for multiple trimmed sub-populations, which are valuable for navigating the trade-off between sample size and variance reduction inherent in trimming. We validate our methods in simulation, on the 2007-2008 National Health and Nutrition Examination Survey, and on a semi-synthetic Medicare dataset and find promising results in all settings.

翻译：观察性因果推断中一种常用的方差缩减方法是基于倾向性的样本截断，即从样本中剔除具有极端倾向性值的个体。当数据满足同方差性且倾向性模型为参数化模型时，该方法具有理论依据（Yang and Ding, 2018; Crump et al., 2009）。但在现代情境下，当使用非参数模型分析异方差数据时，现有理论已无法支撑当前实践。本研究通过开发样本截断的新方法与新理论来应对这一挑战。我们的贡献包含三方面：首先，我们提出选择截断单元的新流程。与以往工作不同，我们不仅剔除倾向性值极小的单元，也剔除条件方差极端的单元。其次，我们为截断后的推断提供了新的理论保证。具体而言，我们展示了在无需回归达到参数收敛速率的情况下，如何对截断子总体进行推断。我们仅需双机器学习文献中常见的四分之一次方根速率假设。该结果同样适用于传统倾向性截断方法，因此可能具有独立价值。最后，我们提出基于自助法的多截断子总体同时有效置信区间构建方法，这对权衡样本截断中样本量与方差缩减之间的固有矛盾具有重要价值。我们在仿真实验、2007-2008美国国家健康与营养调查数据及半合成医疗保险数据集上验证了所提方法，发现所有场景下均取得良好效果。