Researchers produce thousands of scholarly documents containing valuable technical knowledge. The community faces the laborious task of reading these documents to identify, extract, and synthesize information. To automate information gathering, document-level question answering (QA) offers a flexible framework where human-posed questions can be adapted to extract diverse knowledge. Finetuning QA systems requires access to labeled data (tuples of context, question and answer). However, data curation for document QA is uniquely challenging because the context (i.e. answer evidence passage) needs to be retrieved from potentially long, ill-formatted documents. Existing QA datasets sidestep this challenge by providing short, well-defined contexts that are unrealistic in real-world applications. We present a three-stage document QA approach: (1) text extraction from PDF; (2) evidence retrieval from extracted texts to form well-posed contexts; (3) QA to extract knowledge from contexts to return high-quality answers -- extractive, abstractive, or Boolean. Using QASPER for evaluation, our detect-retrieve-comprehend (DRC) system achieves a +7.19 improvement in Answer-F1 over existing baselines while delivering superior context selection. Our results demonstrate that DRC holds tremendous promise as a flexible framework for practical scientific document QA.
翻译:研究者每年产出数千篇蕴含宝贵技术知识的学术文献,社区面临着通过通读文档来识别、提取和整合信息的繁重任务。为了自动化信息采集过程,文档级问答(QA)提供了一种灵活框架——通过适配人类提出的问题来提取多元化知识。微调问答系统需要访问标注数据(包含上下文、问题和答案的三元组)。然而,文档问答的数据标注尤为困难,因为上下文(即答案证据段落)需要从可能冗长且格式混乱的文档中检索获取。现有问答数据集通过提供简短且格式规整的上下文来回避这一挑战,但这在实际应用中并不现实。本文提出三阶段文档问答方法:(1)PDF文本提取;(2)从提取文本中检索证据以构建结构良好的上下文;(3)通过问答从上下文中提取知识,返回高质量答案(包含抽取式、生成式或布尔型答案)。基于QASPER数据集的评估表明,我们提出的"检测-检索-理解"(DRC)系统在答案F1值上相较现有基线提升7.19个百分点,同时实现了更优的上下文选择。实验结果表明,DRC作为面向实际科学文档问答的灵活框架展现出巨大潜力。