Visual Commonsense Reasoning (VCR) refers to answering questions and providing explanations based on images. While existing methods achieve high prediction accuracy, they often overlook bias in datasets and lack debiasing strategies. In this paper, our analysis reveals co-occurrence and statistical biases in both textual and visual data. We introduce the VCR-OOD datasets, comprising VCR-OOD-QA and VCR-OOD-VA subsets, which are designed to evaluate the generalization capabilities of models across two modalities. Furthermore, we analyze the causal graphs and prediction shortcuts in VCR and adopt a backdoor adjustment method to remove bias. Specifically, we create a dictionary based on the set of correct answers to eliminate prediction shortcuts. Experiments demonstrate the effectiveness of our debiasing method across different datasets.
翻译:视觉常识推理(VCR)指基于图像回答问题并提供解释。尽管现有方法实现了较高的预测准确率,但它们常常忽视数据集中的偏见,并缺乏去偏策略。本文的分析揭示了文本和视觉数据中存在的共现偏见与统计偏见。我们引入了VCR-OOD数据集,包含VCR-OOD-QA和VCR-OOD-VA两个子集,旨在评估模型在双模态上的泛化能力。此外,我们分析了VCR中的因果图与预测捷径,并采用后门调整方法来消除偏见。具体而言,我们基于正确答案集合构建词典以消除预测捷径。实验证明了我们的去偏方法在不同数据集上的有效性。