Large Language Models (LLM) have revolutionized Natural Language Processing (NLP), improving state-of-the-art on many existing tasks and exhibiting emergent capabilities. However, LLMs have not yet been successfully applied on semi-structured document information extraction, which is at the core of many document processing workflows and consists of extracting key entities from a visually rich document (VRD) given a predefined target schema. The main obstacles to LLM adoption in that task have been the absence of layout encoding within LLMs, critical for a high quality extraction, and the lack of a grounding mechanism ensuring the answer is not hallucinated. In this paper, we introduce Language Model-based Document Information Extraction and Localization (LMDX), a methodology to adapt arbitrary LLMs for document information extraction. LMDX can do extraction of singular, repeated, and hierarchical entities, both with and without training data, while providing grounding guarantees and localizing the entities within the document. In particular, we apply LMDX to the PaLM 2-S LLM and evaluate it on VRDU and CORD benchmarks, setting a new state-of-the-art and showing how LMDX enables the creation of high quality, data-efficient parsers.
翻译:大型语言模型(LLM)已彻底改变自然语言处理(NLP)领域,在众多现有任务上提升了性能极限,并涌现出新的能力。然而,LLM尚未成功应用于半结构化文档信息抽取——这一任务是许多文档处理流程的核心,旨在根据预定义的目标模式从视觉丰富的文档(VRD)中提取关键实体。LLM在此任务中应用的主要障碍在于:LLM缺乏对布局编码的支持(这对高质量抽取至关重要),且缺乏确保答案无幻觉的基础机制。本文提出基于语言模型的文档信息抽取与定位(LMDX)方法,可适配任意LLM进行文档信息抽取。LMDX能够抽取单值、重复值和层级实体,既支持有训练数据场景也支持无训练数据场景,同时提供基础保证并在文档中定位实体位置。我们特别将LMDX应用于PaLM 2-S LLM,并在VRDU和CORD基准测试上进行评估,创下新的性能纪录,展示了LMDX如何支持构建高质量、数据高效的解析器。