We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks, including document question answering (DocVQA) and scene text analysis. Our approach introduces enhancement across several dimensions: by adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability and minimize hallucinations. Additionally, TextMonkey can be finetuned to gain the ability to comprehend commands for clicking screenshots. Overall, our method notably boosts performance across various benchmark datasets, achieving increases of 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, especially with a score of 561 on OCRBench, surpassing prior open-sourced large multimodal models for document understanding. Code will be released at https://github.com/Yuliang-Liu/Monkey.
翻译:本文提出TextMonkey,一种面向文本密集型任务(包括文档问答(DocVQA)和场景文本分析)的大型多模态模型(LMM)。我们的方法在多个维度引入改进:通过采用零初始化的移位窗口注意力机制,在更高输入分辨率下实现跨窗口连通性,并稳定早期训练;我们假设图像可能包含冗余token,通过利用相似性筛选关键token,既能精简token长度,又能提升模型性能。此外,通过扩展模型能力以涵盖文本定位和基础(grounding),并在响应中融入位置信息,我们增强了可解释性并减少了幻觉。同时,TextMonkey可通过微调获得理解点击屏幕截图指令的能力。总体而言,我们的方法在多个基准数据集上显著提升性能:在场景文本中心VQA、文档导向VQA和KIE任务上分别提升5.2%、6.9%和2.8%,尤其在OCRBench上取得561分的成绩,超越了此前开源的大型文档理解多模态模型。代码将发布于https://github.com/Yuliang-Liu/Monkey。