Visual question answering (VQA) is a critical multimodal task in which an agent must answer questions according to the visual cue. Unfortunately, language bias is a common problem in VQA, which refers to the model generating answers only by associating with the questions while ignoring the visual content, resulting in biased results. We tackle the language bias problem by proposing a self-supervised counterfactual metric learning (SC-ML) method to focus the image features better. SC-ML can adaptively select the question-relevant visual features to answer the question, reducing the negative influence of question-irrelevant visual features on inferring answers. In addition, question-irrelevant visual features can be seamlessly incorporated into counterfactual training schemes to further boost robustness. Extensive experiments have proved the effectiveness of our method with improved results on the VQA-CP dataset. Our code will be made publicly available.
翻译:[translated abstract in Chinese]
视觉问答(VQA)是一项关键的多模态任务,其中智能体需根据视觉线索回答问题。然而,语言偏差是VQA中的常见问题,即模型仅通过关联问题生成答案而忽略视觉内容,导致结果存在偏差。我们提出了一种自监督反事实度量学习(SC-ML)方法以聚焦图像特征,从而解决语言偏差问题。SC-ML能够自适应选择与问题相关的视觉特征来回答问题,减少与问题无关的视觉特征对推理答案的负面影响。此外,与问题无关的视觉特征可无缝融入反事实训练方案以进一步增强鲁棒性。大量实验证明了我们方法的有效性,在VQA-CP数据集上取得了改进结果。我们的代码将公开提供。