We consider the problem of learning from data corrupted by underrepresentation bias, where positive examples are filtered from the data at different, unknown rates for a fixed number of sensitive groups. We show that with a small amount of unbiased data, we can efficiently estimate the group-wise drop-out parameters, even in settings where intersectional group membership makes learning each intersectional rate computationally infeasible. Using this estimate for the group-wise drop-out rate, we construct a re-weighting scheme that allows us to approximate the loss of any hypothesis on the true distribution, even if we only observe the empirical error on a biased sample. Finally, we present an algorithm encapsulating this learning and re-weighting process, and we provide strong PAC-style guarantees that, with high probability, our estimate of the risk of the hypothesis over the true distribution will be arbitrarily close to the true risk.
翻译:我们考虑从存在欠表示偏差的数据中学习的问题,其中正例以不同的未知比率被从固定数量的敏感群体数据中过滤掉。我们证明,通过少量无偏数据,可以高效估计群体层面的遗漏参数,即使在交叉群体身份使得学习每个交叉比率在计算上不可行的情况下也是如此。利用群体层面遗漏率的这一估计,我们构建了一种重加权方案,使得即使仅观测到偏差样本上的经验误差,也能近似任意假设在真实分布上的损失。最后,我们提出一种封装了这一学习与重加权过程的算法,并提供了强PAC风格的保证:以高概率,我们对假设在真实分布上风险的估计将任意接近真实风险。