Benefiting from large-scale pretrained vision language models (VLMs), the performance of Visual Question Answering (VQA) has approached human oracle performance. However, finetuning large-scale pretrained VLMs with limited data usually suffers from overfitting and poor generalization issues, leading to a lack of model robustness. In this paper, we aim to improve the input robustness, \ie the ability of models to defend against visual and linguistic input variations as well as shortcut learning involved in inputs, from the perspective of Information Bottleneck when adapting pretrained VLMs to the downstream VQA task. Generally, internal representations obtained by pretrained VLMs inevitably contain irrelevant and redundant information for a specific downstream task, resulting in statistically spurious correlations and insensitivity to input variations. To encourage the obtained representations to converge to a minimal sufficient statistic in vision-language learning, we propose the Correlation Information Bottleneck (CIB) principle, which seeks a tradeoff between representation compression and redundancy by minimizing the mutual information (MI) between inputs and internal representations while maximizing the MI between outputs and the representations. Furthermore, CIB measures the internal correlations among visual and linguistic inputs and representations via a symmetrized joint MI estimation. Extensive experiments on five VQA datasets of input robustness demonstrate the effectiveness and superiority of the proposed CIB in terms of robustness and accuracy.
翻译:受益于大规模预训练视觉语言模型(VLMs),视觉问答(VQA)的性能已接近人类水平。然而,在数据有限的情况下微调大规模预训练VLMs通常面临过拟合和泛化能力差的问题,导致模型缺乏鲁棒性。本文旨在从信息瓶颈角度出发,在将预训练VLMs适配到下游VQA任务时,提升输入鲁棒性——即模型抵御视觉和语言输入变化以及输入中捷径学习的能力。通常,预训练VLMs获得的内部表示不可避免地包含与特定下游任务无关的冗余信息,导致统计伪相关并对输入变化不敏感。为促使所获表示在视觉语言学习中收敛为最小充分统计量,我们提出相关性信息瓶颈(CIB)原理,通过最小化输入与内部表示之间的互信息(MI)并最大化输出与表示之间的MI,在表示压缩与冗余之间寻求权衡。此外,CIB通过对称联合MI估计度量视觉和语言输入及表示之间的内部相关性。在五个输入鲁棒性VQA数据集上的大量实验表明,所提出的CIB在鲁棒性和准确性方面均具有有效性和优越性。