In this paper, we propose a new method to enhance compositional understanding in pre-trained vision and language models (VLMs) without sacrificing performance in zero-shot multi-modal tasks. Traditional fine-tuning approaches often improve compositional reasoning at the cost of degrading multi-modal capabilities, primarily due to the use of global hard negative (HN) loss, which contrasts global representations of images and texts. This global HN loss pushes HN texts that are highly similar to the original ones, damaging the model's multi-modal representations. To overcome this limitation, we propose Fine-grained Selective Calibrated CLIP (FSC-CLIP), which integrates local hard negative loss and selective calibrated regularization. These innovations provide fine-grained negative supervision while preserving the model's representational integrity. Our extensive evaluations across diverse benchmarks for both compositionality and multi-modal tasks show that FSC-CLIP not only achieves compositionality on par with state-of-the-art models but also retains strong multi-modal capabilities. Code is available at: https://github.com/ytaek-oh/fsc-clip.
翻译:本文提出一种新方法,可在不牺牲零样本多模态任务性能的前提下,增强预训练视觉语言模型(VLMs)的组合理解能力。传统的微调方法通常以损害多模态能力为代价来提升组合推理能力,这主要归因于使用全局硬负例(HN)损失函数——该函数通过对比图像与文本的全局表征实现优化。全局HN损失会迫使与原始文本高度相似的硬负例文本在表征空间中远离,从而破坏模型的多模态表征能力。为克服这一局限,我们提出细粒度选择性校准CLIP(FSC-CLIP),该方法融合了局部硬负例损失与选择性校准正则化机制。这些创新技术能在提供细粒度负监督的同时,保持模型的表征完整性。我们在组合性与多模态任务的多样化基准测试中进行了广泛评估,结果表明FSC-CLIP不仅达到了与最先进模型相当的组合性性能,同时保持了强大的多模态能力。代码发布于:https://github.com/ytaek-oh/fsc-clip。