Vision-Language Models (VLMs) have shown promising capabilities in handling various multimodal tasks, yet they struggle in long-context scenarios, particularly in tasks involving videos, high-resolution images, or lengthy image-text documents. In our work, we first conduct an empirical analysis of the long-context capabilities of VLMs using our augmented long-context multimodal datasets. Our findings reveal that directly applying the positional encoding mechanism used for textual tokens to visual tokens is suboptimal, and VLM performance degrades sharply when the position encoding exceeds the model's context window. To address this, we propose Variable Visual Position Encoding (V2PE), a novel positional encoding approach that employs variable and smaller increments for visual tokens, enabling more efficient management of long multimodal sequences. Our experiments demonstrate the effectiveness of V2PE to enhances VLMs' ability to effectively understand and reason over long multimodal contexts. We further integrate V2PE with our augmented long-context multimodal datasets to fine-tune the open-source VLM, InternVL2. The fine-tuned model achieves strong performance on both standard and long-context multimodal tasks. Notably, when the sequence length of the training dataset is increased to 256K tokens, the model is capable of processing multimodal sequences up to 1M tokens, highlighting its potential for real-world long-context applications.
翻译:视觉语言模型(VLMs)在处理各类多模态任务中展现出潜力,但在长上下文场景中仍面临挑战,尤其是在涉及视频、高分辨率图像或长篇图文文档的任务中。本研究首先基于我们构建的增强型长上下文多模态数据集,对VLMs的长上下文能力进行了实证分析。研究发现,直接将文本标记使用的位置编码机制应用于视觉标记并非最优方案,且当位置编码超出模型上下文窗口时,VLM性能会急剧下降。为此,我们提出可变视觉位置编码(V2PE),这是一种新颖的位置编码方法,通过对视觉标记采用可变且更小的增量步长,实现对长多模态序列的更高效管理。实验证明,V2PE能有效增强VLMs对长多模态上下文的理解与推理能力。我们进一步将V2PE与增强型长上下文多模态数据集结合,对开源VLM模型InternVL2进行微调。微调后的模型在标准及长上下文多模态任务中均表现出色。值得注意的是,当训练数据集的序列长度增至256K标记时,该模型能够处理长达1M标记的多模态序列,彰显了其在现实世界长上下文应用中的潜力。