We study the problem of learning with selectively labeled data, which arises when outcomes are only partially labeled due to historical decision-making. The labeled data distribution may substantially differ from the full population, especially when the historical decisions and the target outcome can be simultaneously affected by some unobserved factors. Consequently, learning with only the labeled data may lead to severely biased results when deployed to the full population. Our paper tackles this challenge by exploiting the fact that in many applications the historical decisions were made by a set of heterogeneous decision-makers. In particular, we analyze this setup in a principled instrumental variable (IV) framework. We establish conditions for the full-population risk of any given prediction rule to be point-identified from the observed data and provide sharp risk bounds when the point identification fails. We further propose a weighted learning approach that learns prediction rules robust to the label selection bias in both identification settings. Finally, we apply our proposed approach to a semi-synthetic financial dataset and demonstrate its superior performance in the presence of selection bias.
翻译:我们研究了选择性标注数据下的学习问题,该问题源于历史决策导致的结果仅被部分标注。当历史决策与目标结局可能同时受到某些未观测因素影响时,标注数据的分布可能与总体人群存在显著差异。因此,仅利用标注数据学习可能导致在总体人群部署时产生严重偏误。本文通过利用历史决策由一组异构决策者做出的这一事实来应对这一挑战。具体而言,我们在一个原则性的工具变量(IV)框架下系统分析该设定。我们建立了任意预测规则总体风险可从观测数据中点识别的条件,并在点识别失效时给出锐化风险界。进一步提出一种加权学习方法,使得在两种识别设定下学习的预测规则均对标签选择偏误具有稳健性。最后,我们将所提方法应用于半合成金融数据集,验证其在选择偏误存在时的优越性能。