We consider the problem of learning from data corrupted by underrepresentation bias, where positive examples are filtered from the data at different, unknown rates for a fixed number of sensitive groups. We show that with a small amount of unbiased data, we can efficiently estimate the group-wise drop-out parameters, even in settings where intersectional group membership makes learning each intersectional rate computationally infeasible. Using this estimate for the group-wise drop-out rate, we construct a re-weighting scheme that allows us to approximate the loss of any hypothesis on the true distribution, even if we only observe the empirical error on a biased sample. Finally, we present an algorithm encapsulating this learning and re-weighting process, and we provide strong PAC-style guarantees that, with high probability, our estimate of the risk of the hypothesis over the true distribution will be arbitrarily close to the true risk.
翻译:我们考虑从欠表示偏倚破坏的数据中学习的问题,其中正例以不同未知比率从固定数量的敏感群体数据中被过滤掉。我们表明,即使交叉群体成员身份使得学习每个交叉比率在计算上不可行,利用少量无偏数据仍能有效估计群体级丢失参数。利用群体级丢失率的估计值,我们构建了一种重加权方案,使得即使在仅观察有偏样本上的经验误差时,也能近似任何假设在真实分布上的损失。最后,我们提出了一个封装此学习与重加权过程的算法,并提供了强PAC风格保证:以高概率,我们对假设在真实分布上风险的估计将任意接近真实风险。