Query-focused summarization (QFS) aims to extract or generate a summary of an input document that directly answers or is relevant to a given query. The lack of large-scale datasets in the form of documents, queries, and summaries has hindered model development in this area. In contrast, multiple large-scale high-quality datasets for generic summarization exist. We hypothesize that there is a hidden query for each summary sentence in a generic summarization annotation, and we utilize a large-scale pretrained language model to recover it. In this way, we convert four generic summarization benchmarks into a new QFS benchmark dataset, LMGQS, which consists of over 1 million document-query-summary samples. We thoroughly investigate the properties of our proposed dataset and establish baselines with state-of-the-art summarization models. By fine-tuning a language model on LMGQS, we achieve state-of-the-art zero-shot and supervised performance on multiple existing QFS benchmarks, demonstrating the high quality and diversity of LMGQS.
翻译:查询聚焦摘要(QFS)旨在从输入文档中提取或生成直接回答给定查询或与其相关的摘要。该领域缺乏以文档、查询和摘要形式存在的大规模数据集,这阻碍了模型的发展。相比之下,通用摘要领域存在多个大规模高质量数据集。我们假设通用摘要标注中的每个摘要句子都对应一个隐藏查询,并利用大规模预训练语言模型来恢复该查询。通过这种方式,我们将四个通用摘要基准转换为一个新的QFS基准数据集LMGQS,该数据集包含超过100万个文档-查询-摘要样本。我们深入研究了所提出数据集的性质,并采用最先进的摘要模型建立了基线。通过在LMGQS上微调语言模型,我们在多个现有QFS基准上取得了最先进的零样本和监督性能,证明了LMGQS的高质量和多样性。