Youtu-Parsing: Perception, Structuring and Recognition via High-Parallelism Decoding

Kun Yin,Yunfei Wu,Bing Liu,Zhongpeng Cai,Xiaotian Li,Huang Chen,Xin Li,Haoyu Cao,Yinsong Liu,Deqiang Jiang,Xing Sun,Yunsheng Wu,Qianyu Li,Antai Guo,Yanzhen Liao,Yanqiu Qu,Haodong Lin,Chengxu He,Shuangyin Liu

This paper presents Youtu-Parsing, an efficient and versatile document parsing model designed for high-performance content extraction. The architecture employs a native Vision Transformer (ViT) featuring a dynamic-resolution visual encoder to extract shared document features, coupled with a prompt-guided Youtu-LLM-2B language model for layout analysis and region-prompted decoding. Leveraging this decoupled and feature-reusable framework, we introduce a high-parallelism decoding strategy comprising two core components: token parallelism and query parallelism. The token parallelism strategy concurrently generates up to 64 candidate tokens per inference step, which are subsequently validated through a verification mechanism. This approach yields a 5--11x speedup over traditional autoregressive decoding and is particularly well-suited for highly structured scenarios, such as table recognition. To further exploit the advantages of region-prompted decoding, the query parallelism strategy enables simultaneous content prediction for multiple bounding boxes (up to five), providing an additional 2x acceleration while maintaining output quality equivalent to standard decoding. Youtu-Parsing encompasses a diverse range of document elements, including text, formulas, tables, charts, seals, and hierarchical structures. Furthermore, the model exhibits strong robustness when handling rare characters, multilingual text, and handwritten content. Extensive evaluations demonstrate that Youtu-Parsing achieves state-of-the-art (SOTA) performance on both the OmniDocBench and olmOCR-bench benchmarks. Overall, Youtu-Parsing demonstrates significant experimental value and practical utility for large-scale document intelligence applications.

翻译：本文提出Youtu-Parsing，一种高效且通用的文档解析模型，旨在实现高性能内容提取。该架构采用原生视觉Transformer（ViT），配备动态分辨率视觉编码器以提取共享文档特征，并结合提示引导的Youtu-LLM-2B语言模型进行布局分析与区域提示解码。利用这种解耦且特征可复用的框架，我们引入了一种高并行解码策略，包含两个核心组件：令牌并行与查询并行。令牌并行策略在每个推理步骤中同时生成最多64个候选令牌，随后通过验证机制进行筛选。该方法相比传统自回归解码实现了5至11倍的加速，尤其适用于高度结构化的场景（如表格识别）。为进一步发挥区域提示解码的优势，查询并行策略支持同时对多个边界框（最多五个）进行内容预测，在保持输出质量与标准解码相当的同时，提供额外的2倍加速。Youtu-Parsing涵盖多样化的文档元素，包括文本、公式、表格、图表、印章及层次化结构。此外，该模型在处理生僻字、多语言文本及手写内容时展现出强大的鲁棒性。大量评估表明，Youtu-Parsing在OmniDocBench和olmOCR-bench基准测试中均达到了最先进的性能水平。总体而言，Youtu-Parsing在大规模文档智能应用中展现出显著的实验价值与实用价值。