Several problems in statistics involve the combination of high-variance unbiased estimators with low-variance estimators that are only unbiased under strong assumptions. A notable example is the estimation of causal effects while combining small experimental datasets with larger observational datasets. There exist a series of recent proposals on how to perform such a combination, even when the bias of the low-variance estimator is unknown. To build intuition for the differing trade-offs of competing approaches, we argue for examining the finite-sample estimation error of each approach as a function of the unknown bias. This includes understanding the bias threshold -- the largest bias for which a given approach improves over using the unbiased estimator alone. Though this lens, we review several recent proposals, and observe in simulation that different approaches exhibits qualitatively different behavior. We also introduce a simple alternative approach, which compares favorably in simulation to recent alternatives, having a higher bias threshold and generally making a more conservative trade-off between best-case performance (when the bias is zero) and worst-case performance (when the bias is adversarially chosen). More broadly, we prove that for any amount of (unknown) bias, the MSE of this estimator can be bounded in a transparent way that depends on the variance / covariance of the underlying estimators that are being combined.
翻译:统计学中的若干问题涉及将高方差无偏估计量与仅在强假设下成立的低方差估计量相结合。一个典型例子是在结合小型实验数据集与较大观测数据集时对因果效应的估计。近年来,学界提出了多种方法来实现此类组合,即使低方差估计量的偏差未知。为直观理解不同方法在权衡取舍上的差异,我们主张考察每种方法在有限样本下的估计误差关于未知偏差的函数特征,这包括理解偏差阈值——即给定方法相比单独使用无偏估计量能获得改进的最大偏差范围。通过这一视角,我们评述了近期若干方法,并通过仿真观察到不同方法呈现出本质不同的行为模式。我们同时提出一种简单替代方法,该方法的偏差阈值更高,且在最优情形(偏差为零)与最坏情形(偏差被对抗性选择)之间做出更保守的权衡,仿真结果优于近期其他方法。更广泛地,我们证明:对于任意(未知)偏差量,该估计量的均方误差可被透明地界定,其界限取决于被组合估计量的方差/协方差结构。