We study the problem of completing various visual document understanding (VDU) tasks, e.g., question answering and information extraction, on real-world documents through human-written instructions. To this end, we propose InstructDoc, the first large-scale collection of 30 publicly available VDU datasets, each with diverse instructions in a unified format, which covers a wide range of 12 tasks and includes open document types/formats. Furthermore, to enhance the generalization performance on VDU tasks, we design a new instruction-based document reading and understanding model, InstructDr, that connects document images, image encoders, and large language models (LLMs) through a trainable bridging module. Experiments demonstrate that InstructDr can effectively adapt to new VDU datasets, tasks, and domains via given instructions and outperforms existing multimodal LLMs and ChatGPT without specific training.
翻译:摘要: 本研究探讨了通过人工编写的指令完成真实文档中多种视觉文档理解(VDU)任务(如问答与信息抽取)的问题。为此,我们提出了InstructDoc——首个大规模整合30个公开VDU数据集的数据集集合,每个数据集均以统一格式提供多样化指令,涵盖12类任务及开放文档类型/格式。此外,为提升VDU任务的泛化性能,我们设计了一种新型基于指令的文档阅读与理解模型InstructDr,该模型通过可训练的桥接模块将文档图像、图像编码器与大型语言模型(LLMs)相连接。实验表明,InstructDr能够通过给定指令有效适应新的VDU数据集、任务及领域,其性能优于未经过特定训练的现有多模态LLMs和ChatGPT。