Large-scale datasets are increasingly being used to inform decision making. While this effort aims to ground policy in real-world evidence, challenges have arisen as selection bias and other forms of distribution shifts often plague observational data. Previous attempts to provide robust inference have given guarantees depending on a user-specified amount of possible distribution shift (e.g., the maximum KL divergence between the observed and target distributions). However, decision makers will often have additional knowledge about the target distribution which constrains the kind of possible shifts. To leverage such information, we propose a framework that enables statistical inference in the presence of selection bias which obeys user-specified constraints in the form of functions whose expectation is known under the target distribution. The output is high-probability bounds on the value of an estimand for the target distribution. Hence, our method leverages domain knowledge in order to partially identify a wide class of estimands. We analyze the computational and statistical properties of methods to estimate these bounds and show that our method can produce informative bounds on a variety of simulated and semisynthetic tasks, as well as in a real-world use case.
翻译:大规模数据集正日益用于指导决策。尽管这一努力旨在将政策建立在真实世界证据之上,但选择偏差及其他形式的数据分布偏移常困扰观测数据,由此引发了挑战。先前为提供稳健推断所做的尝试,依据用户指定的可能分布偏移量(例如观测分布与目标分布之间的最大KL散度)给出了保障。然而,决策者通常掌握关于目标分布的额外知识,这些知识限制了可能出现的偏移类型。为利用此类信息,我们提出了一种框架,该框架能够在服从用户指定约束(以函数形式呈现,且这些函数在目标分布下的期望已知)的选择偏差存在时进行统计推断。其输出结果是关于目标分布下待估量取值的高概率边界。因此,我们的方法利用领域知识来部分识别广泛的待估量类型。我们分析了估算这些边界方法的计算与统计性质,并表明,我们的方法能在多种模拟与半合成任务、以及真实世界用例中提供具有信息量的边界。