We study the problem of learning with selectively labeled data, which arises when outcomes are only partially labeled due to historical decision-making. The labeled data distribution may substantially differ from the full population, especially when the historical decisions and the target outcome can be simultaneously affected by some unobserved factors. Consequently, learning with only the labeled data may lead to severely biased results when deployed to the full population. Our paper tackles this challenge by exploiting the fact that in many applications the historical decisions were made by a set of heterogeneous decision-makers. In particular, we analyze this setup in a principled instrumental variable (IV) framework. We establish conditions for the full-population risk of any given prediction rule to be point-identified from the observed data and provide sharp risk bounds when the point identification fails. We further propose a weighted learning approach that learns prediction rules robust to the label selection bias in both identification settings. Finally, we apply our proposed approach to a semi-synthetic financial dataset and demonstrate its superior performance in the presence of selection bias.
翻译:本文研究了选择性标注数据下的学习问题。当结果因历史决策而仅被部分标注时,标注数据的分布可能与总体人群存在显著差异,尤其是在历史决策与目标结果可能同时受到某些未观测因素影响的情况下。因此,仅利用标注数据进行学习,若应用于总体人群可能会导致严重偏差。本文通过利用众多应用中历史决策由一组异构决策者制定这一事实来应对该挑战。具体而言,我们在严谨的工具变量框架下分析这一设定。我们建立了任意给定预测规则总体风险可从观测数据中点识别的条件,并在点识别失效时给出锐化风险界。进一步地,我们提出一种加权学习方法,能够在两种识别情景下学习对标签选择偏差具有鲁棒性的预测规则。最后,我们将所提方法应用于半合成的金融数据集,并证明其在存在选择偏差时的优越性能。