This work presents DocPedia, a novel large multimodal model (LMM) for versatile OCR-free document understanding, capable of parsing images up to 2,560$\times$2,560 resolution. Unlike existing work either struggle with high-resolution documents or give up the large language model thus vision or language ability constrained, our DocPedia directly processes visual input in the frequency domain rather than the pixel space. The unique characteristic enables DocPedia to capture a greater amount of visual and textual information using a limited number of visual tokens. To consistently enhance both perception and comprehension abilities of our model, we develop a dual-stage training strategy and enrich instructions/annotations of all training tasks covering multiple document types. Extensive quantitative and qualitative experiments conducted on various publicly available benchmarks confirm the mutual benefits of jointly learning perception and comprehension tasks. The results provide further evidence of the effectiveness and superior performance of our DocPedia over other methods.
翻译:本文提出DocPedia,一种新颖的大多模态模型(LMM),旨在实现无需OCR的通用文档理解,能够处理分辨率高达2,560×2,560像素的图像。与现有工作要么难以处理高分辨率文档,要么放弃大语言模型导致视觉或语言能力受限不同,我们的DocPedia直接在频域而非像素空间中处理视觉输入。这一独特特性使得DocPedia能够使用有限数量的视觉标记捕获更多的视觉和文本信息。为持续增强模型的感知与理解能力,我们开发了双阶段训练策略,并丰富了覆盖多种文档类型的训练任务的指令与标注。在多个公开基准上进行的广泛定量与定性实验证实了联合学习感知与理解任务的互益性。结果进一步证明了DocPedia相较于其他方法的有效性与优越性能。