Text-rich images, where text serves as the central visual element guiding the overall understanding, are prevalent in real-world applications, such as presentation slides, scanned documents, and webpage snapshots. Tasks involving multiple text-rich images are especially challenging, as they require not only understanding the content of individual images but reasoning about inter-relationships and logical flows across multiple visual inputs. Despite the importance of these scenarios, current multimodal large language models (MLLMs) struggle to handle such tasks due to two key challenges: (1) the scarcity of high-quality instruction tuning datasets for text-rich multi-image scenarios, and (2) the difficulty in balancing image resolution with visual feature sequence length. To address these challenges, we propose \OurMethod, a MLLM designed specifically for handling vision-language tasks involving multiple text-rich images. First, we curated about one million high-quality multimodal instruction-tuning data, tailored to text-rich, multi-image scenarios. Second, we developed an adaptive high-resolution multi-image encoding module to dynamically optimize the allocation of visual sequence length based on the original aspect ratios and resolutions of the input images. Experiments across a wide range of benchmarks demonstrate our model's superior capabilities in text-rich, multi-image evaluations and competitive performance in general domain evaluations.
翻译:文本密集型图像(其中文本作为引导整体理解的核心视觉元素)在现实应用中十分普遍,例如演示文稿幻灯片、扫描文档和网页截图。涉及多个文本密集型图像的任务尤其具有挑战性,因为它们不仅需要理解单个图像的内容,还需要对多个视觉输入之间的相互关系和逻辑流进行推理。尽管这些场景非常重要,但当前的多模态大语言模型(MLLMs)在处理此类任务时面临两大关键挑战:(1)缺乏针对文本密集型多图像场景的高质量指令微调数据集;(2)难以在图像分辨率与视觉特征序列长度之间取得平衡。为应对这些挑战,我们提出了\OurMethod,这是一种专门为处理涉及多个文本密集型图像的视觉语言任务而设计的MLLM。首先,我们精心构建了约一百万条高质量的多模态指令微调数据,专门针对文本密集型多图像场景定制。其次,我们开发了一个自适应高分辨率多图像编码模块,能够根据输入图像的原始宽高比和分辨率动态优化视觉序列长度的分配。在广泛基准测试上的实验表明,我们的模型在文本密集型多图像评估中具有卓越能力,并在通用领域评估中展现出有竞争力的性能。