Computer vision datasets frequently contain spurious correlations between task-relevant labels and (easy to learn) latent task-irrelevant attributes (e.g. context). Models trained on such datasets learn "shortcuts" and underperform on bias-conflicting slices of data where the correlation does not hold. In this work, we study the problem of identifying such slices to inform downstream bias mitigation strategies. We propose First Amplify Correlations and Then Slice to Discover Bias (FACTS), wherein we first amplify correlations to fit a simple bias-aligned hypothesis via strongly regularized empirical risk minimization. Next, we perform correlation-aware slicing via mixture modeling in bias-aligned feature space to discover underperforming data slices that capture distinct correlations. Despite its simplicity, our method considerably improves over prior work (by as much as 35% precision@10) in correlation bias identification across a range of diverse evaluation settings. Our code is available at: https://github.com/yvsriram/FACTS.
翻译:计算机视觉数据集中常存在任务相关标签与易于学习的潜在任务无关属性(如上下文)之间的虚假相关性。在此类数据集上训练的模型会学习“捷径”,且在相关性不成立的偏差冲突数据切片上表现不佳。本文研究如何识别此类切片,以指导下游的偏差缓解策略。我们提出先放大相关性再切片以发现偏差(FACTS)方法,首先通过强正则化经验风险最小化放大相关性,以拟合简单的偏差对齐假设;随后在偏差对齐特征空间中通过混合模型进行相关性感知切片,以发现捕获不同相关性的性能欠佳数据切片。尽管方法简单,我们的方法在多种评估设置下的相关性偏差识别中显著优于先前工作(精确率@10提升高达35%)。代码已开源:https://github.com/yvsriram/FACTS。