Contrastively-trained Vision-Language Models (VLMs) like CLIP have become the de facto approach for discriminative vision-language representation learning. However, these models have limited language understanding, often exhibiting a "bag of words" behavior. At the same time, Large Vision-Language Models (LVLMs), which combine vision encoders with LLMs, have been shown capable of detailed vision-language reasoning, yet their autoregressive nature renders them less suitable for discriminative tasks. In this work, we propose to combine "the best of both worlds": a new training approach for discriminative fine-tuning of LVLMs that results in strong discriminative and compositional capabilities. Essentially, our approach converts a generative LVLM into a discriminative one, unlocking its capability for powerful image-text discrimination combined with enhanced language understanding. Our contributions include: (1) A carefully designed training/optimization framework that utilizes image-text pairs of variable length and granularity for training the model with both contrastive and next-token prediction losses. This is accompanied by ablation studies that justify the necessity of our framework's components. (2) A parameter-efficient adaptation method using a combination of soft prompting and LoRA adapters. (3) Significant improvements over state-of-the-art CLIP-like models of similar size, including standard image-text retrieval benchmarks and notable gains in compositionality.
翻译:对比训练视觉语言模型(如CLIP)已成为判别式视觉语言表征学习的主流方法。然而,这些模型的语言理解能力有限,常表现出"词袋"行为。与此同时,将视觉编码器与大语言模型结合的大型视觉语言模型(LVLMs)虽能进行细致的视觉语言推理,但其自回归特性使其较难适用于判别式任务。本研究提出融合"两者优势"的新方法:一种针对LVLMs判别式微调的训练策略,可同时获得强大的判别能力与组合能力。本质上,该方法将生成式LVLM转化为判别式模型,使其在增强语言理解能力的同时具备强大的图文判别能力。我们的贡献包括:(1)精心设计的训练/优化框架,利用可变长度与粒度的图文对,通过对比损失与下一词预测损失联合训练模型,并辅以消融实验验证框架各组成部分的必要性。(2)结合软提示与LoRA适配器的参数高效适配方法。(3)在同等规模的最先进CLIP类模型基础上实现显著提升,包括标准图文检索基准测试及组合性能力的显著进步。