We present TextMonkey, a large multimodal model (LMM) tailored for text-centric tasks. Our approach introduces enhancement across several dimensions: By adopting Shifted Window Attention with zero-initialization, we achieve cross-window connectivity at higher input resolutions and stabilize early training; We hypothesize that images may contain redundant tokens, and by using similarity to filter out significant tokens, we can not only streamline the token length but also enhance the model's performance. Moreover, by expanding our model's capabilities to encompass text spotting and grounding, and incorporating positional information into responses, we enhance interpretability. It also learns to perform screenshot tasks through finetuning. Evaluation on 12 benchmarks shows notable improvements: 5.2% in Scene Text-Centric tasks (including STVQA, TextVQA, and OCRVQA), 6.9% in Document-Oriented tasks (such as DocVQA, InfoVQA, ChartVQA, DeepForm, Kleister Charity, and WikiTableQuestions), and 2.8% in Key Information Extraction tasks (comprising FUNSD, SROIE, and POIE). It outperforms in scene text spotting with a 10.9\% increase and sets a new standard on OCRBench, a comprehensive benchmark consisting of 29 OCR-related assessments, with a score of 561, surpassing previous open-sourced large multimodal models for document understanding. Code will be released at https://github.com/Yuliang-Liu/Monkey.
翻译:我们提出TextMonkey,一种专为文本中心任务设计的大型多模态模型(LMM)。该模型在多个维度进行了创新:通过采用零初始化滑动窗口注意力机制,实现了更高输入分辨率下的跨窗口连接,并稳定了早期训练过程;我们假设图像可能包含冗余标记,通过基于相似性筛选关键标记,不仅精简了标记长度,还提升了模型性能。此外,通过扩展模型能力以涵盖文本检测与定位功能,并在输出中融入位置信息,增强了可解释性。该模型还可通过微调学习执行截图相关任务。在12个基准测试上的评估显示显著提升:场景文本中心任务(包括STVQA、TextVQA和OCRVQA)提升5.2%,文档导向任务(如DocVQA、InfoVQA、ChartVQA、DeepForm、Kleister Charity和WikiTableQuestions)提升6.9%,关键信息提取任务(涵盖FUNSD、SROIE和POIE)提升2.8%。其中,场景文本检测能力提升10.9%,并在包含29项OCR相关评估的综合基准OCRBench上以561分创下新纪录,超越了此前所有开源的文档理解大型多模态模型。代码将于https://github.com/Yuliang-Liu/Monkey 开源。