Training-Free Acceleration for Document Parsing Vision-Language Model with Hierarchical Speculative Decoding

Wenhui Liao,Hongliang Li,Pengyu Xie,Xinyu Cai,Yufan Shen,Yi Xin,Qi Qin,Shenglong Ye,Tianbin Li,Ming Hu,Junjun He,Yihao Liu,Wenhai Wang,Min Dou,Bin Fu,Botian Shi,Yu Qiao,Lianwen Jin

from arxiv, Preliminary version of an ongoing project; the paper will be refined and extended in subsequent revisions

Document parsing is a fundamental task in multimodal understanding, supporting a wide range of downstream applications such as information extraction and intelligent document analysis. Benefiting from strong semantic modeling and robust generalization, VLM-based end-to-end approaches have emerged as the mainstream paradigm in recent years. However, these models often suffer from substantial inference latency, as they must auto-regressively generate long token sequences when processing long-form documents. In this work, motivated by the extremely long outputs and complex layout structures commonly found in document parsing, we propose a training-free and highly efficient acceleration method. Inspired by speculative decoding, we employ a lightweight document parsing pipeline as a draft model to predict batches of future tokens, while the more accurate VLM verifies these draft predictions in parallel. Moreover, we further exploit the layout-structured nature of documents by partitioning each page into independent regions, enabling parallel decoding of each region using the same draft-verify strategy. The final predictions are then assembled according to the natural reading order. Experimental results demonstrate the effectiveness of our approach: on the general-purpose OmniDocBench, our method provides a 2.42x lossless acceleration for the dots.ocr model, and achieves up to 4.89x acceleration on long-document parsing tasks. We will release our code to facilitate reproducibility and future research.

翻译：文档解析是多模态理解中的一项基础任务，支撑着信息抽取和智能文档分析等广泛的下游应用。得益于强大的语义建模能力和稳健的泛化性能，基于视觉语言模型（VLM）的端到端方法近年来已成为主流范式。然而，这些模型在处理长文档时通常存在显著的推理延迟，因为它们必须以自回归方式生成长标记序列。在本工作中，针对文档解析中普遍存在的极长输出和复杂版面结构，我们提出了一种无需训练的高效加速方法。受推测解码的启发，我们采用一个轻量级的文档解析流程作为草稿模型来预测批量的未来标记，同时由更精确的VLM并行验证这些草稿预测。此外，我们进一步利用文档的版面结构化特性，将每个页面划分为独立的区域，从而能够使用相同的草稿-验证策略对每个区域进行并行解码。最终预测结果按照自然阅读顺序进行组装。实验结果表明了我们方法的有效性：在通用基准OmniDocBench上，我们的方法为dots.ocr模型提供了2.42倍的无损加速，并在长文档解析任务上实现了高达4.89倍的加速。我们将公开代码以促进可复现性和未来研究。