Natural language explanation in visual question answer (VQA-NLE) aims to explain the decision-making process of models by generating natural language sentences to increase users' trust in the black-box systems. Existing post-hoc methods have achieved significant progress in obtaining a plausible explanation. However, such post-hoc explanations are not always aligned with human logical inference, suffering from the issues on: 1) Deductive unsatisfiability, the generated explanations do not logically lead to the answer; 2) Factual inconsistency, the model falsifies its counterfactual explanation for answers without considering the facts in images; and 3) Semantic perturbation insensitivity, the model can not recognize the semantic changes caused by small perturbations. These problems reduce the faithfulness of explanations generated by models. To address the above issues, we propose a novel self-supervised \textbf{M}ulti-level \textbf{C}ontrastive \textbf{L}earning based natural language \textbf{E}xplanation model (MCLE) for VQA with semantic-level, image-level, and instance-level factual and counterfactual samples. MCLE extracts discriminative features and aligns the feature spaces from explanations with visual question and answer to generate more consistent explanations. We conduct extensive experiments, ablation analysis, and case study to demonstrate the effectiveness of our method on two VQA-NLE benchmarks.
翻译:视觉问答中的自然语言解释(VQA-NLE)旨在通过生成自然语言句子解释模型的决策过程,以增强用户对黑箱系统的信任。现有事后方法在获得合理解释方面取得了显著进展。然而,这类事后解释并不总是与人类逻辑推理一致,存在以下问题:1)演绎不可满足性,生成的解释无法在逻辑上推导出答案;2)事实不一致性,模型在生成反事实解释时未考虑图像中的事实;3)语义扰动不敏感性,模型无法识别微小扰动导致的语义变化。这些问题降低了模型生成解释的忠实性。针对上述问题,我们提出了一种新颖的自监督多层级对比学习自然语言解释模型(MCLE),该模型基于语义级、图像级和实例级的事实与反事实样本进行VQA解释。MCLE通过提取判别性特征并统一解释、视觉问题与答案的特征空间,生成更一致的解释。我们在两个VQA-NLE基准数据集上进行了大量实验、消融分析和案例研究,验证了所提方法的有效性。