Recently, many studies have demonstrated that exclusively incorporating OCR-derived text and spatial layouts with large language models (LLMs) can be highly effective for document understanding tasks. However, existing methods that integrate spatial layouts with text have limitations, such as producing overly long text sequences or failing to fully leverage the autoregressive traits of LLMs. In this work, we introduce Interleaving Layout and Text in a Large Language Model (LayTextLLM)} for document understanding. In particular, LayTextLLM projects each bounding box to a single embedding and interleaves it with text, efficiently avoiding long sequence issues while leveraging autoregressive traits of LLMs. LayTextLLM not only streamlines the interaction of layout and textual data but also shows enhanced performance in Key Information Extraction (KIE) and Visual Question Answering (VQA). Comprehensive benchmark evaluations reveal significant improvements, with a 27.0% increase on KIE tasks and 24.1% on VQA tasks compared to previous state-of-the-art document understanding MLLMs, as well as a 15.5% improvement over other SOTA OCR-based LLMs on KIE tasks.
翻译:近年来,许多研究表明,将OCR提取的文本与空间布局信息专门结合到大型语言模型(LLMs)中,对于文档理解任务非常有效。然而,现有整合空间布局与文本的方法存在局限性,例如生成过长的文本序列或未能充分利用LLMs的自回归特性。在本工作中,我们提出了用于文档理解的交错布局与文本的大型语言模型(LayTextLLM)。具体而言,LayTextLLM将每个边界框投影为单个嵌入向量,并将其与文本交错排列,从而有效避免了长序列问题,同时充分利用了LLMs的自回归特性。LayTextLLM不仅简化了布局与文本数据的交互,还在关键信息提取(KIE)和视觉问答(VQA)任务中展现出增强的性能。全面的基准评估显示,与先前最先进的文档理解多模态大语言模型(MLLMs)相比,该方法在KIE任务上实现了27.0%的性能提升,在VQA任务上提升了24.1%;同时,在KIE任务上,相比其他最先进的基于OCR的LLMs,性能也提高了15.5%。