Embrace Divergence for Richer Insights: A Multi-document Summarization Benchmark and a Case Study on Summarizing Diverse Information from News Articles

2023 年 9 月 17 日

翻译：拥抱分歧以获取更丰富的见解：面向多文档摘要的基准测试及新闻文章中多样化信息摘要的案例研究

Kung-Hsiang Huang,Philippe Laban,Alexander R. Fabbri,Prafulla Kumar Choubey,Shafiq Joty,Caiming Xiong,Chien-Sheng Wu

Previous research in multi-document news summarization has typically concentrated on collating information that all sources agree upon. However, to our knowledge, the summarization of diverse information dispersed across multiple articles about an event has not been previously investigated. The latter imposes a different set of challenges for a summarization model. In this paper, we propose a new task of summarizing diverse information encountered in multiple news articles encompassing the same event. To facilitate this task, we outlined a data collection schema for identifying diverse information and curated a dataset named DiverseSumm. The dataset includes 245 news stories, with each story comprising 10 news articles and paired with a human-validated reference. Moreover, we conducted a comprehensive analysis to pinpoint the position and verbosity biases when utilizing Large Language Model (LLM)-based metrics for evaluating the coverage and faithfulness of the summaries, as well as their correlation with human assessments. We applied our findings to study how LLMs summarize multiple news articles by analyzing which type of diverse information LLMs are capable of identifying. Our analyses suggest that despite the extraordinary capabilities of LLMs in single-document summarization, the proposed task remains a complex challenge for them mainly due to their limited coverage, with GPT-4 only able to cover less than 40% of the diverse information on average.

翻译：过去的多文档新闻摘要研究通常聚焦于整合所有来源一致认同的信息。然而，据我们所知，针对分散在多篇相关文章中的事件多样化信息进行摘要的任务此前尚未被探索。后者对摘要模型提出了不同的挑战。本文提出了一项新任务：对同一事件的多篇新闻报道中出现的多样化信息进行摘要。为推进该任务，我们设计了一个识别多样化信息的数据收集方案，并构建了名为DiverseSumm的数据集。该数据集包含245个新闻故事，每个故事由10篇新闻报道及人工验证的参考摘要组成。此外，我们进行了全面分析，以定位基于大语言模型的评估指标在衡量摘要覆盖度与忠实度时的位置偏差和冗长偏差，以及这些指标与人工评估的相关性。我们运用研究发现，分析了LLMs如何总结多篇新闻文章，并探究了LLMs能够识别哪些类型的多样化信息。分析表明，尽管LLMs在单文档摘要方面展现出非凡能力，但本任务对其而言仍是一项复杂挑战——主要受限于其覆盖能力不足，GPT-4平均仅能覆盖不到40%的多样化信息。