The empirical risk minimization approach to data-driven decision making requires access to training data drawn under the same conditions as those that will be faced when the decision rule is deployed. However, in a number of settings, we may be concerned that our training sample is biased in the sense that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called conditional $\Gamma$-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under $\Gamma$-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in a case study on prediction of mental health scores from health survey data and a case study on ICU length of stay prediction.
翻译:数据驱动决策的经验风险最小化方法要求获取与决策规则部署时面临条件相同的训练数据。然而,在某些情况下,我们可能担心训练样本存在偏差,即某些群体(由可观测或不可观测属性表征)相对于总体可能代表性不足或过度代表;在此情境下,对训练集进行经验风险最小化可能无法产生在部署时表现良好的规则。我们提出一种称为条件$\Gamma$偏差抽样的抽样偏差模型,其中观测协变量可对样本选择概率产生任意程度的影响,但样本选择概率中未解释的变异量受常数因子限制。应用分布鲁棒优化框架,我们提出一种学习决策规则的方法,该规则能最小化在测试分布族下产生的最坏情况风险,这些测试分布可在$\Gamma$偏差抽样下生成训练分布。我们应用Rockafellar和Uryasev的结果证明该问题等价于一个增强的凸风险最小化问题。通过筛法我们为学习对抽样偏差具有鲁棒性的模型提供了统计保证,并提出一种深度学习算法,其损失函数能捕捉我们的鲁棒学习目标。我们在心理健康评分预测(基于健康调查数据)和ICU住院时长预测两个案例研究中实证验证了所提方法的有效性。