We present NewsQs (news-cues), a dataset that provides question-answer pairs for multiple news documents. To create NewsQs, we augment a traditional multi-document summarization dataset with questions automatically generated by a T5-Large model fine-tuned on FAQ-style news articles from the News On the Web corpus. We show that fine-tuning a model with control codes produces questions that are judged acceptable more often than the same model without them as measured through human evaluation. We use a QNLI model with high correlation with human annotations to filter our data. We release our final dataset of high-quality questions, answers, and document clusters as a resource for future work in query-based multi-document summarization.
翻译:我们提出了NewsQs(新闻线索)数据集,该数据集为多篇新闻文档提供问答对。为构建NewsQs,我们对传统多文档摘要数据集进行增强,通过基于网络新闻语料库中常见问题解答式新闻文章微调的T5-Large模型自动生成问题。研究表明,相较于未使用控制码的模型,通过控制码微调的模型生成的问题在人工评估中更常被判定为可接受。我们采用与人工标注高度相关的QNLI模型对数据进行筛选。最终发布的高质量问题、答案及文档聚类数据集,将为基于查询的多文档摘要研究工作提供基础资源。