Large Vision-Language Models (LVLMs) achieve strong performance on single-image tasks, but their performance declines when multiple images are provided as input. One major reason is the cross-image information leakage, where the model struggles to distinguish information across different images. Existing LVLMs already employ delimiter tokens to mark the start and end of each image, yet our analysis reveals that these tokens fail to effectively block cross-image information leakage. To enhance their effectiveness, we propose a method that scales the hidden states of delimiter tokens. This enhances the model's ability to preserve image-specific information by reinforcing intra-image interaction and limiting undesired cross-image interactions. Consequently, the model is better able to distinguish between images and reason over them more accurately. Experiments show performance gains on multi-image benchmarks such as Mantis, MuirBench, MIRB, and QBench2. We further evaluate our method on text-only tasks that require clear distinction. The method improves performance on multi-document and multi-table understanding benchmarks, including TQABench, MultiNews, and WCEP-10. Notably, our method requires no additional training or inference cost.
翻译:大型视觉语言模型(LVLMs)在单图像任务上表现出色,但当输入包含多张图像时,其性能会下降。一个主要原因是跨图像信息泄露,即模型难以区分不同图像间的信息。现有LVLMs已采用分隔符标记来标注每张图像的起始和结束位置,但我们的分析表明,这些标记未能有效阻断跨图像信息泄露。为提升其有效性,我们提出一种对分隔符标记的隐藏状态进行缩放的方法。该方法通过增强图像内交互并限制不必要的跨图像交互,提升了模型保留图像特定信息的能力。因此,模型能更好地区分图像并进行更准确的推理。实验表明,该方法在Mantis、MuirBench、MIRB和QBench2等多图像基准测试中取得了性能提升。我们进一步在需要清晰区分的纯文本任务上评估了该方法,其在多文档和多表格理解基准测试(包括TQABench、MultiNews和WCEP-10)上的性能均得到改善。值得注意的是,该方法无需额外的训练或推理成本。