Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by over-emphasizing the text tokens that are less correlated with or even contradictory with the input images. In this paper, we advocate for assigning distinct contributions for each text token based on its visual correlation. Specifically, we present by contrasting image inputs, the difference in prediction logits on each text token provides strong guidance of visual correlation. We therefore introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens. Our experimental results demonstrate that CAL consistently improves different types of VLMs across different resolutions and model sizes on various benchmark datasets. Importantly, our method incurs minimal additional computational overhead, rendering it highly efficient compared to alternative data scaling strategies. Codes are available at https://github.com/foundation-multimodal-models/CAL.
翻译:现有视觉语言模型(VLMs)中的图像-文本模态对齐以自回归方式平等对待每个文本标记。尽管这种方法简单有效,但由于过度强调与输入图像相关性较低甚至矛盾的文本标记,导致跨模态对齐效果欠佳。本文主张根据每个文本标记的视觉相关性为其分配不同的贡献权重。具体而言,我们通过对比图像输入发现,每个文本标记的预测对数差异为视觉相关性提供了有力指导。因此,我们引入对比对齐(CAL)——一种简单而有效的重加权策略,优先训练视觉相关的标记。实验结果表明,CAL在不同基准数据集上,针对不同分辨率和模型规模的各种VLM均能带来持续改进。重要的是,我们的方法仅产生极小的额外计算开销,相较于其他数据扩展策略具有显著的高效性。代码发布于 https://github.com/foundation-multimodal-models/CAL。