Leveraging vast training data, multimodal large language models (MLLMs) have demonstrated formidable general visual comprehension capabilities and achieved remarkable performance across various tasks. However, their performance in visual document understanding still leaves much room for improvement. This discrepancy is primarily attributed to the fact that visual document understanding is a fine-grained prediction task. In natural scenes, MLLMs typically use low-resolution images, leading to a substantial loss of visual information. Furthermore, general-purpose MLLMs do not excel in handling document-oriented instructions. In this paper, we propose a High-Resolution Visual Document Assistant (HRVDA), which bridges the gap between MLLMs and visual document understanding. This model employs a content filtering mechanism and an instruction filtering module to separately filter out the content-agnostic visual tokens and instruction-agnostic visual tokens, thereby achieving efficient model training and inference for high-resolution images. In addition, we construct a document-oriented visual instruction tuning dataset and apply a multi-stage training strategy to enhance the model's document modeling capabilities. Extensive experiments demonstrate that our model achieves state-of-the-art performance across multiple document understanding datasets, while maintaining training efficiency and inference speed comparable to low-resolution models.
翻译:借助大规模训练数据,多模态大语言模型(MLLMs)展现出强大的通用视觉理解能力,并在各类任务中取得了显著性能。然而,它们在视觉文档理解方面的表现仍有较大提升空间。这一差距主要源于视觉文档理解属于细粒度预测任务。在自然场景中,多模态大语言模型通常使用低分辨率图像,导致视觉信息大量丢失。此外,通用型多模态大语言模型在处理文档导向指令时表现不佳。本文提出高分辨率视觉文档助手(HRVDA),以弥合多模态大语言模型与视觉文档理解之间的鸿沟。该模型采用内容过滤机制和指令过滤模块,分别滤除与内容无关的视觉标记和与指令无关的视觉标记,从而实现面向高分辨率图像的高效模型训练与推理。同时,我们构建了文档导向的视觉指令微调数据集,并采用多阶段训练策略以增强模型的文档建模能力。大量实验表明,本模型在多个文档理解数据集上均达到最先进性能,同时其训练效率与推理速度可与低分辨率模型相媲美。