In this study, we formulate an OCR-free sequence generation model for visual document understanding (VDU). Our model not only parses text from document images but also extracts the spatial coordinates of the text based on the multi-head architecture. Named as Coordinate-aware End-to-end Document Parser (CREPE), our method uniquely integrates these capabilities by introducing a special token for OCR text, and token-triggered coordinate decoding. We also proposed a weakly-supervised framework for cost-efficient training, requiring only parsing annotations without high-cost coordinate annotations. Our experimental evaluations demonstrate CREPE's state-of-the-art performances on document parsing tasks. Beyond that, CREPE's adaptability is further highlighted by its successful usage in other document understanding tasks such as layout analysis, document visual question answering, and so one. CREPE's abilities including OCR and semantic parsing not only mitigate error propagation issues in existing OCR-dependent methods, it also significantly enhance the functionality of sequence generation models, ushering in a new era for document understanding studies.
翻译:在本研究中,我们提出了一种用于视觉文档理解(VDU)的无OCR序列生成模型。该模型不仅能从文档图像中解析文本,还能基于多头架构提取文本的空间坐标。该方法被命名为坐标感知端到端文档解析器(CREPE),其独特之处在于通过引入OCR文本的特殊标记以及标记触发的坐标解码机制来整合这些功能。我们还提出了一种弱监督框架以实现低成本训练,该框架仅需解析标注而无需高成本的坐标标注。实验评估表明,CREPE在文档解析任务中达到了最先进的性能。此外,CREPE在布局分析、文档视觉问答等其他文档理解任务中的成功应用进一步凸显其适应性。CREPE所具备的OCR与语义解析能力不仅缓解了现有基于OCR方法中的误差传播问题,还显著增强了序列生成模型的功能,开启了文档理解研究的新纪元。