Many datasets are biased, namely they contain easy-to-learn features that are highly correlated with the target class only in the dataset but not in the true underlying distribution of the data. For this reason, learning unbiased models from biased data has become a very relevant research topic in the last years. In this work, we tackle the problem of learning representations that are robust to biases. We first present a margin-based theoretical framework that allows us to clarify why recent contrastive losses (InfoNCE, SupCon, etc.) can fail when dealing with biased data. Based on that, we derive a novel formulation of the supervised contrastive loss (epsilon-SupInfoNCE), providing more accurate control of the minimal distance between positive and negative samples. Furthermore, thanks to our theoretical framework, we also propose FairKL, a new debiasing regularization loss, that works well even with extremely biased data. We validate the proposed losses on standard vision datasets including CIFAR10, CIFAR100, and ImageNet, and we assess the debiasing capability of FairKL with epsilon-SupInfoNCE, reaching state-of-the-art performance on a number of biased datasets, including real instances of biases in the wild.
翻译:许多数据集存在偏差,即它们包含易于学习的特征,这些特征仅在数据集中与目标类别高度相关,但在数据的真实潜在分布中并非如此。因此,从有偏数据中学习无偏模型已成为近年来一个非常重要的研究课题。本文致力于解决对偏差具有鲁棒性的表征学习问题。我们首先提出了一个基于边界的理论框架,该框架有助于阐明为何近期对比损失函数(如InfoNCE、SupCon等)在处理有偏数据时可能失效。基于此,我们推导出监督对比损失的一种新形式(epsilon-SupInfoNCE),从而更精确地控制正负样本之间的最小距离。此外,借助我们的理论框架,我们还提出了FairKL——一种新的去偏正则化损失函数,即使在极强偏差的数据下也能表现良好。我们在CIFAR10、CIFAR100和ImageNet等标准视觉数据集上验证了所提出的损失函数,并评估了结合epsilon-SupInfoNCE时FairKL的去偏能力,在多个有偏数据集(包括现实中的真实偏差实例)上达到了当前最优性能。