Observational data is often readily available in large quantities, but can lead to biased causal effect estimates due to the presence of unobserved confounding. Recent works attempt to remove this bias by supplementing observational data with experimental data, which, when available, is typically on a smaller scale due to the time and cost involved in running a randomised controlled trial. In this work, we prove a theorem that places fundamental limits on this ``best of both worlds'' approach. Using the framework of impossible inference, we show that although it is possible to use experimental data to \emph{falsify} causal effect estimates from observational data, in general it is not possible to \emph{validate} such estimates. Our theorem proves that while experimental data can be used to detect bias in observational studies, without additional assumptions on the smoothness of the correction function, it can not be used to remove it. We provide a practical example of such an assumption, developing a novel Gaussian Process based approach to construct intervals which contain the true treatment effect with high probability, both inside and outside of the support of the experimental data. We demonstrate our methodology on both simulated and semi-synthetic datasets and make the \href{https://github.com/Jakefawkes/Obs_and_exp_data}{code available}.
翻译:观察性数据通常易于大量获取,但由于存在未观测混杂因素,可能导致因果效应估计产生偏差。近期研究尝试通过补充实验数据来消除这种偏差——尽管随机对照试验因时间与成本限制,其可用数据规模通常较小。本研究通过证明一个定理,从根本上限制了这种"两全其美"方法的可行性。基于不可行推断框架,我们证明:虽然可以利用实验数据来*证伪*观察性数据的因果效应估计,但通常无法*验证*此类估计。该定理表明,实验数据虽能检测观察性研究中的偏差,若未对校正函数平滑性施加额外假设,则无法消除偏差。我们提供了一个实际假设案例,提出基于高斯过程的新方法构建高概率包含真实处理效应的置信区间,该区间同时覆盖实验数据支持域内外区域。我们在模拟与半合成数据集上验证了该方法,并将\href{https://github.com/Jakefawkes/Obs_and_exp_data}{代码开源}。