Existing image-text modality alignment in Vision Language Models (VLMs) treats each text token equally in an autoregressive manner. Despite being simple and effective, this method results in sub-optimal cross-modal alignment by over-emphasizing the text tokens that are less correlated with or even contradictory with the input images. In this paper, we advocate for assigning distinct contributions for each text token based on its visual correlation. Specifically, we present by contrasting image inputs, the difference in prediction logits on each text token provides strong guidance of visual correlation. We therefore introduce Contrastive ALignment (CAL), a simple yet effective re-weighting strategy that prioritizes training visually correlated tokens. Our experimental results demonstrate that CAL consistently improves different types of VLMs across different resolutions and model sizes on various benchmark datasets. Importantly, our method incurs minimal additional computational overhead, rendering it highly efficient compared to alternative data scaling strategies. Codes are available at https://github.com/foundation-multimodal-models/CAL.
翻译:现有视觉语言模型(VLMs)中的图像-文本模态对齐以自回归方式平等对待每个文本标记。尽管这种方法简单有效,但由于过度强调与输入图像相关性较低甚至矛盾的文本标记,导致跨模态对齐效果欠佳。本文主张根据每个文本标记的视觉相关性为其分配不同的贡献权重。具体而言,我们通过对比图像输入发现,每个文本标记的预测逻辑值差异能有效反映视觉相关性。为此,我们提出对比对齐(CAL)——一种简单而有效的重加权策略,优先训练视觉相关的文本标记。实验结果表明,CAL在不同分辨率、模型尺寸的各类VLM模型上,于多种基准数据集均能持续提升性能。值得注意的是,本方法仅引入极小的额外计算开销,相较于其他数据扩展策略具有显著的高效性。代码发布于 https://github.com/foundation-multimodal-models/CAL。