We consider the problem of learning from data corrupted by underrepresentation bias, where positive examples are filtered from the data at different, unknown rates for a fixed number of sensitive groups. We show that with a small amount of unbiased data, we can efficiently estimate the group-wise drop-out parameters, even in settings where intersectional group membership makes learning each intersectional rate computationally infeasible. Using this estimate for the group-wise drop-out rate, we construct a re-weighting scheme that allows us to approximate the loss of any hypothesis on the true distribution, even if we only observe the empirical error on a biased sample. Finally, we present an algorithm encapsulating this learning and re-weighting process, and we provide strong PAC-style guarantees that, with high probability, our estimate of the risk of the hypothesis over the true distribution will be arbitrarily close to the true risk.
翻译:我们研究了从受低代表性偏差污染的数据中学习的问题,其中正例以不同且未知的比率从固定数量的敏感群体中被过滤。研究表明,即使交叉群体成员身份导致学习每个交叉率在计算上不可行,仅需少量无偏数据即可有效估计群体层面的脱落参数。利用该群体脱落率估计,我们构建了一种重加权方案,使得即使在仅能观测到有偏样本上的经验误差时,也能近似任何假设在真实分布上的损失。最后,我们提出了一种封装该学习与重加权过程的算法,并提供了强大的PAC风格保证:以高概率保证,我们对假设在真实分布上风险的估计将任意接近真实风险。