Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering

from arxiv, IEEE Transactions on Pattern Analysis and Machine Intelligence, TPAMI 2023. (Extension of CVPR'20 work). arXiv admin note: text overlap with arXiv:2003.06576

Today's VQA models still tend to capture superficial linguistic correlations in the training set and fail to generalize to the test set with different QA distributions. To reduce these language biases, recent VQA works introduce an auxiliary question-only model to regularize the training of targeted VQA model, and achieve dominating performance on diagnostic benchmarks for out-of-distribution testing. However, due to complex model design, these ensemble-based methods are unable to equip themselves with two indispensable characteristics of an ideal VQA model: 1) Visual-explainable: The model should rely on the right visual regions when making decisions. 2) Question-sensitive: The model should be sensitive to the linguistic variations in questions. To this end, we propose a novel model-agnostic Counterfactual Samples Synthesizing and Training (CSST) strategy. After training with CSST, VQA models are forced to focus on all critical objects and words, which significantly improves both visual-explainable and question-sensitive abilities. Specifically, CSST is composed of two parts: Counterfactual Samples Synthesizing (CSS) and Counterfactual Samples Training (CST). CSS generates counterfactual samples by carefully masking critical objects in images or words in questions and assigning pseudo ground-truth answers. CST not only trains the VQA models with both complementary samples to predict respective ground-truth answers, but also urges the VQA models to further distinguish the original samples and superficially similar counterfactual ones. To facilitate the CST training, we propose two variants of supervised contrastive loss for VQA, and design an effective positive and negative sample selection mechanism based on CSS. Extensive experiments have shown the effectiveness of CSST. Particularly, by building on top of model LMH+SAR, we achieve record-breaking performance on all OOD benchmarks.

翻译：当前的VQA模型仍倾向于捕捉训练集中的表面语言相关性，难以泛化至具有不同问答分布分布的测试集。为减少这些语言偏差，近期VQA研究引入辅助问题模型来约束目标VQA模型的训练，并在面向分布外测试的诊断性基准上取得了主导性性能。然而，由于复杂的模型设计，这些基于集成的方法无法具备理想VQA模型的两个关键特性：1）视觉可解释性：模型在决策时应当依赖正确的视觉区域；2）问题敏感性：模型需对问题的语言变化保持敏感。为此，我们提出一种新颖的模型无关策略——反事实样本合成与训练（CSST）。经过CSST训练后，VQA模型被迫关注所有关键目标与词汇，显著提升了视觉可解释性与问题敏感性能力。具体而言，CSST包含两个部分：反事实样本合成（CSS）与反事实样本训练（CST）。CSS通过精心遮蔽图像中的关键目标或问题中的关键词汇，并分配伪真实答案来生成反事实样本。CST不仅使用互补样本训练VQA模型以预测各自真实答案，还促使模型进一步区分原始样本与表面相似的反事实样本。为优化CST训练，我们提出了两种面向VQA的监督对比损失变体，并基于CSS设计了有效的正负样本选择机制。大量实验验证了CSST的有效性。特别地，基于LMH+SAR模型构建的CSST方法在所有分布外基准测试中均实现了破纪录性能。