Various benchmarks have been proposed to test linguistic understanding in pre-trained vision \& language (VL) models. Here we build on the existence task from the VALSE benchmark (Parcalabescu et al, 2022) which we use to test models' understanding of negation, a particularly interesting issue for multimodal models. However, while such VL benchmarks are useful for measuring model performance, they do not reveal anything about the internal processes through which these models arrive at their outputs in such visio-linguistic tasks. We take inspiration from the growing literature on model interpretability to explain the behaviour of VL models on the understanding of negation. Specifically, we approach these questions through an in-depth analysis of the text encoder in CLIP (Radford et al, 2021), a highly influential VL model. We localise parts of the encoder that process negation and analyse the role of attention heads in this task. Our contributions are threefold. We demonstrate how methods from the language model interpretability literature (such as causal tracing) can be translated to multimodal models and tasks; we provide concrete insights into how CLIP processes negation on the VALSE existence task; and we highlight inherent limitations in the VALSE dataset as a benchmark for linguistic understanding.
翻译:已有多种基准测试被提出,用于评估预训练视觉与语言(VL)模型的语言理解能力。本文基于VALSE基准测试(Parcalabescu等人,2022)中的存在性任务,用于检验模型对否定的理解——这对多模态模型而言是一个特别值得关注的问题。然而,尽管此类VL基准测试有助于衡量模型性能,它们并未揭示这些模型在执行此类视觉-语言任务时得出输出的内部处理过程。我们借鉴了模型可解释性领域日益增多的研究,以解释VL模型在理解否定时的行为。具体而言,我们通过对具有高度影响力的VL模型CLIP(Radford等人,2021)中的文本编码器进行深入分析来探讨这些问题。我们定位了编码器中处理否定的部分,并分析了注意力头在此任务中的作用。我们的贡献包括三个方面:展示了如何将语言模型可解释性研究中的方法(如因果追踪)迁移到多模态模型与任务中;提供了关于CLIP如何在VALSE存在性任务中处理否定的具体见解;并指出了VALSE数据集作为语言理解基准测试的内在局限性。