In this work, we propose a method for incorporating question-answering (QA) signals into a summarization model. Our method identifies salient noun phrases (NPs) in the input document by automatically generating wh-questions that are answered by the NPs and automatically determining whether those questions are answered in the gold summaries. This QA-based signal is incorporated into a two-stage summarization model which first marks salient NPs in the input document using a classification model, then conditionally generates a summary. Our experiments demonstrate that the models trained using QA-based supervision generate higher-quality summaries than baseline methods of identifying salient spans on benchmark summarization datasets. Further, we show that the content of the generated summaries can be controlled based on which NPs are marked in the input document. Finally, we propose a method of augmenting the training data so the gold summaries are more consistent with the marked input spans used during training and show how this results in models which learn to better exclude unmarked document content.
翻译:本研究提出了一种将问答(QA)信号融入摘要模型的方法。该方法通过自动生成由名词短语(NP)回答的wh-问题,并自动判定这些问题的答案是否出现在黄金摘要中,从而识别输入文档中的显著名词短语。该问答信号被整合到一个两阶段摘要模型中:首先使用分类模型标记输入文档中的显著名词短语,然后有条件地生成摘要。实验表明,在标准摘要数据集上,基于问答监督训练的模型生成的摘要质量优于识别显著跨度的基线方法。此外,我们发现生成摘要的内容可根据输入文档中标记的名词短语进行控制。最后,我们提出了一种训练数据增强方法,使得黄金摘要与训练时所标记输入跨度更为一致,并展示了该方法如何使模型更好地排除未标记文档内容。