We introduce MonkeyOCR, a document parsing model that advances the state of the art by leveraging a Structure-Recognition-Relation (SRR) triplet paradigm. This design simplifies what would otherwise be a complex multi-tool pipeline and avoids the inefficiencies of processing full pages with giant end-to-end models. In SRR, document parsing is abstracted into three fundamental questions - ``Where is it?'' (structure), ``What is it?'' (recognition), and ``How is it organized?'' (relation) - corresponding to structure detection, content recognition, and relation prediction. To support this paradigm, we present MonkeyDoc, a comprehensive dataset with 4.5 million bilingual instances spanning over ten document types, which addresses the limitations of existing datasets that often focus on a single task, language, or document type. Leveraging the SRR paradigm and MonkeyDoc, we trained a 3B-parameter document foundation model. We further identify parameter redundancy in this model and propose contiguous parameter degradation (CPD), enabling the construction of models from 0.6B to 1.2B parameters that run faster with acceptable performance drop. MonkeyOCR achieves state-of-the-art performance, surpassing previous open-source and closed-source methods, including Gemini 2.5-Pro. Additionally, the model can be efficiently deployed for inference on a single RTX 3090 GPU. Code and models will be released at https://github.com/Yuliang-Liu/MonkeyOCR.
翻译:本文提出MonkeyOCR,一种采用结构-识别-关系三元范式(SRR)的文档解析模型,该设计通过将文档解析抽象为三个核心问题——"目标位置"(结构)、"内容属性"(识别)与"组织关系"(关系),分别对应结构检测、内容识别与关系预测任务,从而简化了传统多工具串联的复杂流程,并避免了使用巨型端到端模型处理整页文档的效率瓶颈。为支撑此范式,我们构建了包含450万双语实例、覆盖十余种文档类型的综合性数据集MonkeyDoc,解决了现有数据集通常局限于单一任务、语言或文档类型的缺陷。基于SRR范式与MonkeyDoc数据集,我们训练了一个30亿参数的文档基础模型。进一步地,我们识别出该模型的参数冗余问题,提出连续参数退化方法(CPD),能够构建参数规模从6亿到12亿的衍生模型,在可接受的性能损失下实现更快的推理速度。MonkeyOCR在多项基准测试中达到最先进性能,超越包括Gemini 2.5-Pro在内的开源与闭源方法。该模型可高效部署于单张RTX 3090 GPU进行推理。代码与模型将在https://github.com/Yuliang-Liu/MonkeyOCR 发布。