Recently proposed long-form question answering (QA) systems, supported by large language models (LLMs), have shown promising capabilities. Yet, attributing and verifying their generated abstractive answers can be difficult, and automatically evaluating their accuracy remains an ongoing challenge. In this work, we introduce a new QA task for answering multi-answer questions by summarizing multiple diverse sources in a semi-extractive fashion. Specifically, Semi-extractive Multi-source QA (SEMQA) requires models to output a comprehensive answer, while mixing factual quoted spans -- copied verbatim from given input sources -- and non-factual free-text connectors that glue these spans together into a single cohesive passage. This setting bridges the gap between the outputs of well-grounded but constrained extractive QA systems and more fluent but harder to attribute fully abstractive answers. Particularly, it enables a new mode for language models that leverages their advanced language generation capabilities, while also producing fine in-line attributions by-design that are easy to verify, interpret, and evaluate. To study this task, we create the first dataset of this kind, QuoteSum, with human-written semi-extractive answers to natural and generated questions, and define text-based evaluation metrics. Experimenting with several LLMs in various settings, we find this task to be surprisingly challenging, demonstrating the importance of QuoteSum for developing and studying such consolidation capabilities.
翻译:近来,基于大型语言模型(LLM)的长文本问答系统展现出令人瞩目的能力。然而,其生成的抽象性答案在归因与验证方面存在困难,且自动评估其准确性仍是持续挑战。本文提出一项新的问答任务:通过半抽取方式整合多源信息来回答多答案问题。具体而言,半抽取式多源问答要求模型在输出综合答案时,混合使用两种成分:从给定输入源逐字摘取的事实性引用片段,以及将这些片段衔接成连贯语段的非事实性自由文本连接器。这一设置弥合了两种答案之间的鸿沟——既具备充分依据但受限于抽取式问答系统的输出,又保留更流畅但难以归因的全抽象答案。特别地,它催生了语言模型的新应用模式:既能发挥其强大的语言生成能力,又能通过设计生成易于验证、解读和评估的细粒度内嵌归因。为研究该任务,我们创建了首个此类数据集QuoteSum,包含针对自然及生成问题的人工撰写的半抽取式答案,并定义了基于文本的评估指标。通过在不同设置下对多种LLM进行实验,我们发现该任务具有惊人的挑战性,这凸显了QuoteSum数据集对于开发与研究此类整合能力的重要意义。