This paper proposes LayoutLLM, a more flexible document analysis method for understanding imaged documents. Visually Rich Document Understanding tasks, such as document image classification and information extraction, have gained significant attention due to their importance. Existing methods have been developed to enhance document comprehension by incorporating pre-training awareness of images, text, and layout structure. However, these methods require fine-tuning for each task and dataset, and the models are expensive to train and operate. To overcome this limitation, we propose a new LayoutLLM that integrates these with large-scale language models (LLMs). By leveraging the strengths of existing research in document image understanding and LLMs' superior language understanding capabilities, the proposed model, fine-tuned with multimodal instruction datasets, performs an understanding of document images in a single model. Our experiments demonstrate improvement over the baseline model in various document analysis tasks.
翻译:本文提出LayoutLLM,一种用于理解图像文档的更灵活的文档分析方法。视觉丰富文档理解任务(如文档图像分类与信息提取)因其重要性而备受关注。现有方法通过融入图像、文本及布局结构的预训练感知能力来增强文档理解,但每个任务和数据集均需单独微调,且模型训练与运行成本高昂。为克服此局限,我们提出新型LayoutLLM,将上述能力与大规模语言模型(LLM)相结合。通过利用现有文档图像理解研究的优势及LLM卓越的语言理解能力,该模型经多模态指令数据集微调后,能以单一模型执行文档图像理解任务。实验表明,该模型在多项文档分析任务上均优于基线模型。