Vision-language (VL) understanding tasks evaluate models' comprehension of complex visual scenes through multiple-choice questions. However, we have identified two dataset biases that models can exploit as shortcuts to resolve various VL tasks correctly without proper understanding. The first type of dataset bias is \emph{Unbalanced Matching} bias, where the correct answer overlaps the question and image more than the incorrect answers. The second type of dataset bias is \emph{Distractor Similarity} bias, where incorrect answers are overly dissimilar to the correct answer but significantly similar to other incorrect answers within the same sample. To address these dataset biases, we first propose Adversarial Data Synthesis (ADS) to generate synthetic training and debiased evaluation data. We then introduce Intra-sample Counterfactual Training (ICT) to assist models in utilizing the synthesized training data, particularly the counterfactual data, via focusing on intra-sample differentiation. Extensive experiments demonstrate the effectiveness of ADS and ICT in consistently improving model performance across different benchmarks, even in domain-shifted scenarios.
翻译:视觉-语言理解任务通过多项选择题评估模型对复杂视觉场景的理解能力。然而,我们发现了两种可被模型利用作为捷径来正确解决各类视觉-语言任务、而无需真正理解的数据集偏差。第一种类型的数据集偏差是"非均衡匹配"偏差,即正确答案与问题和图像的重叠程度高于错误答案。第二种类型的数据集偏差是"干扰项相似性"偏差,即错误答案与正确答案过度不相似,但同一样本中不同错误答案之间却高度相似。为解决这些数据集偏差,我们首先提出对抗性数据合成方法,生成合成训练数据和去偏评估数据。随后引入样本内反事实训练方法,通过聚焦样本内差异化,帮助模型有效利用合成的训练数据,特别是反事实数据。大量实验表明,对抗性数据合成与样本内反事实训练在不同基准测试中(即使在领域迁移场景下)均能持续提升模型性能。