We present a novel OCR-free document understanding framework based on pretrained Multimodal Large Language Models (MLLMs). Our approach employs multi-scale visual features to effectively handle various font sizes within document images. To address the increasing costs of considering the multi-scale visual inputs for MLLMs, we propose the Hierarchical Visual Feature Aggregation (HVFA) module, designed to reduce the number of input tokens to LLMs. Leveraging a feature pyramid with cross-attentive pooling, our approach effectively manages the trade-off between information loss and efficiency without being affected by varying document image sizes. Furthermore, we introduce a novel instruction tuning task, which facilitates the model's text-reading capability by learning to predict the relative positions of input text, eventually minimizing the risk of truncated text caused by the limited capacity of LLMs. Comprehensive experiments validate the effectiveness of our approach, demonstrating superior performance in various document understanding tasks.
翻译:我们提出了一种基于预训练多模态大语言模型(MLLMs)的新型无OCR文档理解框架。我们的方法利用多尺度视觉特征来有效处理文档图像中的各种字体大小。为了解决MLLMs考虑多尺度视觉输入时日益增加的计算成本,我们提出了分层视觉特征聚合(HVFA)模块,旨在减少输入到LLMs的令牌数量。通过利用带有交叉注意力池化的特征金字塔,我们的方法有效地在信息损失与效率之间进行权衡,且不受文档图像尺寸变化的影响。此外,我们引入了一种新颖的指令微调任务,通过学习预测输入文本的相对位置来增强模型的文本阅读能力,最终最小化因LLMs容量有限而导致的文本截断风险。综合实验验证了我们方法的有效性,在各种文档理解任务中展现了优越的性能。