This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding, capable of parsing images up to 2,560$\times$2,560 resolution. Unlike existing work either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained, our DocPedia directly processes visual input in the frequency domain rather than the pixel space. The unique characteristic enables DocPedia to capture a greater amount of visual and textual information using a limited number of visual tokens. To consistently enhance both perception and comprehension abilities of our model, we develop a dual-stage training strategy and enrich instructions/annotations of all training tasks covering multiple document types. Extensive quantitative and qualitative experiments conducted on various publicly available benchmarks confirm the mutual benefits of jointly learning perception and comprehension tasks. The results provide further evidence of the effectiveness and superior performance of our DocPedia over other methods.
翻译:本文提出DocPedia,一种用于通用无OCR文档理解的新型大型多模态模型(LMM),能够解析高达2,560$\times$2,560分辨率的图像。与现有工作要么难以处理高分辨率文档、要么放弃大型语言模型导致视觉或语言能力受限不同,我们的DocPedia直接在频率域而非像素空间中处理视觉输入。这一独特特性使DocPedia能够使用有限数量的视觉标记捕获更多视觉和文本信息。为持续增强模型的感知与理解能力,我们开发了一种双阶段训练策略,并丰富了涵盖多种文档类型的训练任务的指令/标注。在多个公开基准上开展的大量定性与定量实验证实,联合学习感知与理解任务具有互利效果。实验结果进一步证明了DocPedia相较于其他方法在有效性和优越性能方面的证据。