Large Language Models (LLMs) have issues with document question answering (QA) in situations where the document is unable to fit in the small context length of an LLM. To overcome this issue, most existing works focus on retrieving the relevant context from the document, representing them as plain text. However, documents such as PDFs, web pages, and presentations are naturally structured with different pages, tables, sections, and so on. Representing such structured documents as plain text is incongruous with the user's mental model of these documents with rich structure. When a system has to query the document for context, this incongruity is brought to the fore, and seemingly trivial questions can trip up the QA system. To bridge this fundamental gap in handling structured documents, we propose an approach called PDFTriage that enables models to retrieve the context based on either structure or content. Our experiments demonstrate the effectiveness of the proposed PDFTriage-augmented models across several classes of questions where existing retrieval-augmented LLMs fail. To facilitate further research on this fundamental problem, we release our benchmark dataset consisting of 900+ human-generated questions over 80 structured documents from 10 different categories of question types for document QA. Our code and datasets will be released soon on Github.
翻译:大型语言模型(LLMs)在文档问答(QA)任务中面临困境,当文档内容超出其有限上下文长度时表现尤为明显。为克服该问题,现有研究多聚焦于从文档中检索相关上下文,并将其表示为纯文本形式。然而,PDF、网页及演示文稿等文档天然具备页、表格、章节等结构化特征。将此类结构化文档简化为纯文本,与用户对这些富含结构文档的认知模型存在显著偏差。当系统需要从文档中检索上下文时,这种偏差被放大,导致看似简单的问题也能使问答系统失效。为弥合处理结构化文档时存在的根本性鸿沟,我们提出名为PDFTriage的方法,使模型能够基于结构或内容检索上下文。实验表明,在现有检索增强型LLMs无法处理的多种问题类型中,我们提出的PDFTriage增强模型展现出优越性能。为促进该基础问题的进一步研究,我们发布了基准数据集,包含来自10类文档问答问题的80份结构化文档中的人工标注问题900余个。相关代码与数据集即将在Github上开源。