Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components. To address this gap, we propose PDF-MVQA, which is tailored for research journal articles, encompassing multiple pages and multimodal information retrieval. Unlike traditional machine reading comprehension (MRC) tasks, our approach aims to retrieve entire paragraphs containing answers or visually rich document entities like tables and figures. Our contributions include the introduction of a comprehensive PDF Document VQA dataset, allowing the examination of semantically hierarchical layout structures in text-dominant documents. We also present new VRD-QA frameworks designed to grasp textual contents and relations among document layouts simultaneously, extending page-level understanding to the entire multi-page document. Through this work, we aim to enhance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA.
翻译:文档问答(QA)对理解视觉丰富文档(VRD)构成挑战,尤其是那些以长篇文本内容(如研究期刊论文)为主的文档。现有研究主要聚焦于文本稀疏的真实世界文档,而在理解多页文档中层级语义关系以定位多模态组件方面仍存在挑战。为弥补这一空白,我们提出PDF-MVQA,该数据集专为研究期刊论文设计,涵盖多页面及多模态信息检索。与传统的机器阅读理解(MRC)任务不同,我们的方法旨在检索包含答案的完整段落或视觉丰富的文档实体(如表格和图例)。本文贡献包括:引入一个全面的PDF文档VQA数据集,支持对文本主导文档中语义层级布局结构的研究;同时提出新型VRD-QA框架,能够同时理解文档布局中的文本内容及其关系,将页面级理解扩展至整个多页文档。通过本研究,我们旨在增强现有视觉-语言模型在处理VRD-QA中文本主导文档挑战方面的能力。