This study explores the concept of equivariance in vision-language foundation models (VLMs), focusing specifically on the multimodal similarity function that is not only the major training objective but also the core delivery to support downstream tasks. Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched pairs as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes. This allows VLMs to generalize better to nuanced and unseen multimodal compositions. However, modeling equivariance is challenging as the ground truth of semantic change is difficult to collect. For example, given an image-text pair about a dog, it is unclear to what extent the similarity changes when the pixel is changed from dog to cat? To this end, we propose EqSim, a regularization loss that can be efficiently calculated from any two matched training pairs and easily pluggable into existing image-text retrieval fine-tuning. Meanwhile, to further diagnose the equivariance of VLMs, we present a new challenging benchmark EqBen. Compared to the existing evaluation sets, EqBen is the first to focus on "visual-minimal change". Extensive experiments show the lack of equivariance in current VLMs and validate the effectiveness of EqSim. Code is available at https://github.com/Wangt-CN/EqBen.
翻译:本研究探索了视觉-语言基础模型中的等变性概念,特别关注多模态相似度函数——该函数不仅是主要训练目标,更是支撑下游任务的核心交付物。与现有仅将匹配对归类为相似、非匹配对归类为不相似的图像-文本相似度目标不同,等变性要求相似度能根据语义变化忠实地动态调整。这使得视觉-语言模型能更好地泛化至细微且未见过的多模态组合。然而,建模等变性颇具挑战性,因为语义变化的地面真值难以收集。例如,给定一个关于狗的图文对,当像素从狗变为猫时,相似度应变化到何种程度尚不明确。为此,我们提出EqSim,这是一种可高效从任意两个匹配训练对计算、并轻松嵌入现有图像-文本检索微调的正则化损失函数。同时,为进一步诊断视觉-语言模型的等变性,我们提出了新的挑战性基准EqBen。与现有评估集相比,EqBen首次聚焦于"视觉微小变化"。大量实验表明当前视觉-语言模型缺乏等变性,并验证了EqSim的有效性。代码已开源:https://github.com/Wangt-CN/EqBen。