Many common estimators in machine learning and causal inference are linear smoothers, where the prediction is a weighted average of the training outcomes. Some estimators, such as ordinary least squares and kernel ridge regression, allow for arbitrarily negative weights, which improve feature imbalance but often at the cost of increased dependence on parametric modeling assumptions and higher variance. By contrast, estimators like importance weighting and random forests (sometimes implicitly) restrict weights to be non-negative, reducing dependence on parametric modeling and variance at the cost of worse imbalance. In this paper, we propose a unified framework that directly penalizes the level of extrapolation, replacing the current practice of a hard non-negativity constraint with a soft constraint and corresponding hyperparameter. We derive a worst-case extrapolation error bound and introduce a novel "bias-bias-variance" tradeoff, encompassing biases due to feature imbalance, model misspecification, and estimator variance; this tradeoff is especially pronounced in high dimensions, particularly when positivity is poor. We then develop an optimization procedure that regularizes this bound while minimizing imbalance and outline how to use this approach as a sensitivity analysis for dependence on parametric modeling assumptions. We demonstrate the effectiveness of our approach through synthetic experiments and a real-world application, involving the generalization of randomized controlled trial estimates to a target population of interest.
翻译:机器学习与因果推断中的许多常见估计器是线性平滑器,其预测结果为训练结果输出的加权平均。某些估计器,例如普通最小二乘法和核岭回归,允许任意负的权重,这改善了特征不平衡,但往往以增加对参数化建模假设的依赖和更高的方差为代价。相比之下,重要性加权和随机森林等估计器(有时隐式地)将权重限制为非负,从而减少对参数化建模的依赖并降低方差,但代价是更严重的不平衡。本文提出一个统一框架,直接惩罚外推的程度,用软约束和相应的超参数替代当前硬性非负约束的实践。我们推导出最坏情况外推误差界,并引入一种新颖的“偏差-偏差-方差”权衡,涵盖由特征不平衡、模型误设和估计器方差引起的偏差;这种权衡在高维情况下尤为显著,尤其是在正性条件较差时。随后,我们开发了一种优化程序,在最小化不平衡的同时正则化该误差界,并概述了如何将此方法用作对参数化建模假设依赖性的敏感性分析。我们通过合成实验和一项现实应用(涉及将随机对照试验估计推广到目标感兴趣人群)证明了所提方法的有效性。