The widespread success of deep learning models today is owed to the curation of extensive datasets significant in size and complexity. However, such models frequently pick up inherent biases in the data during the training process, leading to unreliable predictions. Diagnosing and debiasing datasets is thus a necessity to ensure reliable model performance. In this paper, we present ConBias, a novel framework for diagnosing and mitigating Concept co-occurrence Biases in visual datasets. ConBias represents visual datasets as knowledge graphs of concepts, enabling meticulous analysis of spurious concept co-occurrences to uncover concept imbalances across the whole dataset. Moreover, we show that by employing a novel clique-based concept balancing strategy, we can mitigate these imbalances, leading to enhanced performance on downstream tasks. Extensive experiments show that data augmentation based on a balanced concept distribution augmented by Conbias improves generalization performance across multiple datasets compared to state-of-the-art methods.
翻译:当前深度学习模型的广泛成功得益于规模庞大且复杂度高的数据集的精心构建。然而,此类模型在训练过程中常常习得数据中固有的偏差,导致预测结果不可靠。因此,诊断并消除数据集中的偏差对于确保模型性能的可靠性至关重要。本文提出ConBias,一个用于诊断和缓解视觉数据集中概念共现偏差的新型框架。ConBias将视觉数据集表示为概念知识图谱,从而能够细致分析虚假概念共现,揭示整个数据集中存在的概念不平衡问题。此外,我们证明通过采用一种新颖的基于团的概念平衡策略,可以缓解这些不平衡,从而提升下游任务的性能。大量实验表明,与现有最先进方法相比,基于ConBias增强的平衡概念分布进行数据增强,可在多个数据集上提升模型的泛化性能。