We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the visual tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks. SeRum represents a substantial advancement towards enabling efficient and effective end-to-end document understanding.
翻译:我们提出了一种名为SeRum(选择性区域理解模型)的新型端到端文档理解模型,用于从文档图像中提取有意义的信息,包括文档分析、检索和办公自动化。与依赖多阶段技术方案且计算成本高昂的现有最先进方法不同,SeRum利用内容感知令牌合并模块,将文档图像理解与识别任务转化为对感兴趣视觉令牌的局部解码过程。该机制使模型能够更关注由查询解码器生成的感兴趣区域,从而提升模型的有效性,并加速生成方案的解码速度。我们还设计了多项预训练任务,以增强模型的理解能力和局部感知能力。实验结果表明,SeRum在文档理解任务上达到了最先进的性能,并在文本定位任务上取得了具有竞争力的结果。SeRum代表着向实现高效且有效的端到端文档理解迈出的重大进步。