The rapid progress in Multimodal Large Language Models (MLLMs) has significantly advanced their ability to process and understand complex visual and textual information. However, the integration of multiple images and extensive textual contexts remains a challenge due to the inherent limitation of the models' capacity to handle long input sequences efficiently. In this paper, we introduce SEEKER, a multimodal large language model designed to tackle this issue. SEEKER aims to optimize the compact encoding of long text by compressing the text sequence into the visual pixel space via images, enabling the model to handle long text within a fixed token-length budget efficiently. Our empirical experiments on six long-context multimodal tasks demonstrate that SEEKER can leverage fewer image tokens to convey the same amount of textual information compared with the OCR-based approach, and is more efficient in understanding long-form multimodal input and generating long-form textual output, outperforming all existing proprietary and open-source MLLMs by large margins.
翻译:多模态大语言模型(MLLMs)的快速发展显著提升了其处理和理解复杂视觉与文本信息的能力。然而,由于模型高效处理长输入序列的内在能力限制,整合多幅图像和大量文本上下文仍然是一个挑战。本文提出了SEEKER,一种旨在解决此问题的多模态大语言模型。SEEKER通过将文本序列压缩到视觉像素空间(即图像中),优化长文本的紧凑编码,使模型能够在固定的令牌长度预算内高效处理长文本。我们在六个长上下文多模态任务上的实证实验表明,与基于OCR的方法相比,SEEKER能够利用更少的图像令牌来传递等量的文本信息,并且在理解长篇幅多模态输入和生成长篇幅文本输出方面更为高效,大幅超越了所有现有的专有和开源MLLMs。