Many datasets are biased, namely they contain easy-to-learn features that are highly correlated with the target class only in the dataset but not in the true underlying distribution of the data. For this reason, learning unbiased models from biased data has become a very relevant research topic in the last years. In this work, we tackle the problem of learning representations that are robust to biases. We first present a margin-based theoretical framework that allows us to clarify why recent contrastive losses (InfoNCE, SupCon, etc.) can fail when dealing with biased data. Based on that, we derive a novel formulation of the supervised contrastive loss (epsilon-SupInfoNCE), providing more accurate control of the minimal distance between positive and negative samples. Furthermore, thanks to our theoretical framework, we also propose FairKL, a new debiasing regularization loss, that works well even with extremely biased data. We validate the proposed losses on standard vision datasets including CIFAR10, CIFAR100, and ImageNet, and we assess the debiasing capability of FairKL with epsilon-SupInfoNCE, reaching state-of-the-art performance on a number of biased datasets, including real instances of biases in the wild.
翻译:许多数据集存在偏差,即其中包含仅在数据集中与目标类别高度相关、但在真实数据分布中并不相关的易学习特征。因此,从有偏数据中学习无偏模型已成为近年来极具相关性的研究课题。本文致力于解决如何学习对偏差具有鲁棒性的表示问题。我们首先提出一个基于间隔的理论框架,用以阐明近期对比损失(如InfoNCE、SupCon等)在处理有偏数据时可能失效的原因。基于此,我们推导出监督对比损失的一种新形式(ε-SupInfoNCE),该形式能更精确地控制正负样本之间的最小距离。此外,借助我们的理论框架,我们还提出了一种新的去偏正则化损失——FairKL,即使在极端有偏数据下也能有效工作。我们在标准视觉数据集(包括CIFAR10、CIFAR100和ImageNet)上验证了所提损失函数的有效性,并通过结合ε-SupInfoNCE的FairKL评估其去偏能力,在多个有偏数据集(包括现实环境中的真实偏差实例)上达到了当前最优性能。