Argument summarisation is a promising but currently under-explored field. Recent work has aimed to provide textual summaries in the form of concise and salient short texts, i.e., key points (KPs), in a task known as Key Point Analysis (KPA). One of the main challenges in KPA is finding high-quality key point candidates from dozens of arguments even in a small corpus. Furthermore, evaluating key points is crucial in ensuring that the automatically generated summaries are useful. Although automatic methods for evaluating summarisation have considerably advanced over the years, they mainly focus on sentence-level comparison, making it difficult to measure the quality of a summary (a set of KPs) as a whole. Aggravating this problem is the fact that human evaluation is costly and unreproducible. To address the above issues, we propose a two-step abstractive summarisation framework based on neural topic modelling with an iterative clustering procedure, to generate key points which are aligned with how humans identify key points. Our experiments show that our framework advances the state of the art in KPA, with performance improvement of up to 14 (absolute) percentage points, in terms of both ROUGE and our own proposed evaluation metrics. Furthermore, we evaluate the generated summaries using a novel set-based evaluation toolkit. Our quantitative analysis demonstrates the effectiveness of our proposed evaluation metrics in assessing the quality of generated KPs. Human evaluation further demonstrates the advantages of our approach and validates that our proposed evaluation metric is more consistent with human judgment than ROUGE scores.
翻译:论点摘要是一个有前景但目前研究不足的领域。近期工作旨在以简洁凝练的短文本形式(即关键点)提供文本摘要,该任务被称为关键点分析(KPA)。KPA的主要挑战之一是即使在小规模语料库中,也能从数十个论点中筛选出高质量的关键点候选。此外,评估关键点对于确保自动生成的摘要实用性至关重要。尽管自动摘要评估方法近年来取得显著进展,但它们主要聚焦于句子级比较,难以衡量摘要(关键点集合)的整体质量。雪上加霜的是,人工评估成本高昂且不可复现。为应对上述问题,我们提出基于神经主题建模与迭代聚类的两阶段抽象式摘要框架,以生成与人类识别关键点方式一致的摘要。实验表明,该框架推进了KPA领域的最新研究进展,在ROUGE及我们提出的评估指标上性能提升高达14个百分点(绝对值)。此外,我们使用新型集合评估工具对生成摘要进行评价。定量分析证明了所提评估指标在评估生成关键点质量方面的有效性。人工评估进一步展示了我们方法的优势,并验证了所提评估指标比ROUGE分数更符合人类判断。