Machine learning algorithms often struggle to eliminate inherent data biases, particularly those arising from unreliable labels, which poses a significant challenge in ensuring fairness. Existing fairness techniques that address label bias typically involve modifying models and intervening in the training process, but these lack flexibility for large-scale datasets. To address this limitation, we introduce a data selection method designed to efficiently and flexibly mitigate label bias, tailored to more practical needs. Our approach utilizes a zero-shot predictor as a proxy model that simulates training on a clean holdout set. This strategy, supported by peer predictions, ensures the fairness of the proxy model and eliminates the need for an additional holdout set, which is a common requirement in previous methods. Without altering the classifier's architecture, our modality-agnostic method effectively selects appropriate training data and has proven efficient and effective in handling label bias and improving fairness across diverse datasets in experimental evaluations.
翻译:机器学习算法往往难以消除固有的数据偏见,特别是由不可靠标签引起的偏见,这对确保公平性构成了重大挑战。现有解决标签偏见的公平性技术通常涉及修改模型和干预训练过程,但这些方法缺乏应对大规模数据集的灵活性。为克服这一局限,我们提出了一种数据选择方法,旨在高效灵活地缓解标签偏见,以适应更实际的需求。我们的方法采用零样本预测器作为代理模型,模拟在干净保留集上的训练。该策略通过同伴预测的支持,确保了代理模型的公平性,并消除了先前方法中常见的额外保留集需求。在不改变分类器架构的前提下,我们提出的模态无关方法能有效选择适当的训练数据,并在实验评估中证明其能高效处理标签偏见并提升多种数据集的公平性。