Existing machine learning models have proven to fail when it comes to their performance for minority groups, mainly due to biases in data. In particular, datasets, especially social data, are often not representative of minorities. In this paper, we consider the problem of representation bias identification on image datasets without explicit attribute values. Using the notion of data coverage for detecting a lack of representation, we develop multiple crowdsourcing approaches. Our core approach, at a high level, is a divide and conquer algorithm that applies a search space pruning strategy to efficiently identify if a dataset misses proper coverage for a given group. We provide a different theoretical analysis of our algorithm, including a tight upper bound on its performance which guarantees its near-optimality. Using this algorithm as the core, we propose multiple heuristics to reduce the coverage detection cost across different cases with multiple intersectional/non-intersectional groups. We demonstrate how the pre-trained predictors are not reliable and hence not sufficient for detecting representation bias in the data. Finally, we adjust our core algorithm to utilize existing models for predicting image group(s) to minimize the coverage identification cost. We conduct extensive experiments, including live experiments on Amazon Mechanical Turk to validate our problem and evaluate our algorithms' performance.
翻译:现有机器学习模型在少数群体上的性能已被证实存在缺陷,其主要原因在于数据中的偏差。特别是数据集(尤其是社会数据)通常无法充分代表少数群体。本文研究了在缺乏显式属性值的情况下,如何识别图像数据集中的表征偏差问题。基于数据覆盖概念来检测表征缺失,我们开发了多种众包方法。核心方法在高层级上采用分治算法,通过搜索空间剪枝策略高效识别数据集是否缺少对特定群体的适当覆盖。我们对该算法进行了多维度理论分析,包括严格的上界证明来保证其近似最优性。以此算法为核心,我们提出多种启发式策略以降低跨不同场景(涉及多组交叉/非交叉群体)的覆盖检测成本。研究表明预训练预测器不可靠,不足以检测数据中的表征偏差。最终,我们调整核心算法以利用现有模型预测图像群体归属,从而最小化覆盖识别成本。我们开展了大量实验(包括Amazon Mechanical Turk上的实时实验)来验证问题设定并评估算法性能。