Key Information Extraction (KIE) from visually-rich documents (VrDs) is a critical task, for which recent Large Language Models (LLMs) and Multi-Modal Large Language Models (MLLMs) have demonstrated strong potential. However, their reliance on autoregressive inference, which generates outputs sequentially, creates a significant efficiency bottleneck, especially as KIE tasks often involve extracting multiple, semantically independent fields. To overcome this limitation, we introduce PIP: a Parallel Inference Paradigm for KIE. Our approach reformulates the problem by using "[mask]" tokens as placeholders for all target values, enabling their simultaneous generation in a single forward pass. To facilitate this paradigm, we develop a tailored mask pre-training strategy and construct large-scale supervised datasets. Experimental results show that our PIP-models achieve a 5-36x inference speedup with negligible performance degradation compared to traditional autoregressive base models. By substantially improving efficiency while maintaining high accuracy, PIP paves the way for scalable and practical real-world KIE solutions.
翻译:从视觉丰富文档中提取关键信息是一项关键任务,近期的大型语言模型和多模态大语言模型已展现出强大潜力。然而,这些模型依赖自回归推理机制顺序生成输出,造成了显著的效率瓶颈,尤其是在关键信息提取任务通常涉及提取多个语义独立字段的情况下。为突破这一限制,我们提出了PIP:一种用于关键信息提取的并行推理范式。该方法通过使用"[mask]"标记作为所有目标值的占位符来重构问题,使其能够在单次前向传播中同时生成。为支持这一范式,我们开发了定制化的掩码预训练策略并构建了大规模监督数据集。实验结果表明,与传统自回归基线模型相比,我们的PIP模型在性能损失可忽略不计的情况下实现了5-36倍的推理加速。通过在大幅提升效率的同时保持高精度,PIP为可扩展且实用的现实世界关键信息提取解决方案开辟了新途径。