Vision-language (VL) understanding tasks evaluate models' comprehension of complex visual scenes through multiple-choice questions. However, we have identified two dataset biases that models can exploit as shortcuts to resolve various VL tasks correctly without proper understanding. The first type of dataset bias is \emph{Unbalanced Matching} bias, where the correct answer overlaps the question and image more than the incorrect answers. The second type of dataset bias is \emph{Distractor Similarity} bias, where incorrect answers are overly dissimilar to the correct answer but significantly similar to other incorrect answers within the same sample. To address these dataset biases, we first propose Adversarial Data Synthesis (ADS) to generate synthetic training and debiased evaluation data. We then introduce Intra-sample Counterfactual Training (ICT) to assist models in utilizing the synthesized training data, particularly the counterfactual data, via focusing on intra-sample differentiation. Extensive experiments demonstrate the effectiveness of ADS and ICT in consistently improving model performance across different benchmarks, even in domain-shifted scenarios.
翻译:视觉-语言理解任务通过多重选择题评估模型对复杂视觉场景的理解能力。然而,我们识别出两类数据集偏差:模型可能利用这些捷径正确解决各类视觉-语言任务,而无需真正理解内容。第一类偏差为不平衡匹配偏差,即正确答案与问题和图像的重叠程度高于错误答案。第二类偏差为干扰项相似性偏差,即错误答案与正确答案差异过大,但与同一样本中的其他错误答案高度相似。为应对这些数据集偏差,我们首先提出对抗性数据合成方法,用于生成合成训练数据和去偏评估数据。随后引入样本内反事实训练方法,通过聚焦样本内差异化,辅助模型利用合成训练数据(特别是反事实数据)。大量实验表明,ADS与ICT方法能在不同基准测试中持续提升模型性能,甚至在领域迁移场景下依然有效。