Compared to news and chat summarization, the development of meeting summarization is hugely decelerated by the limited data. To this end, we introduce a versatile Chinese meeting summarization dataset, dubbed VCSum, consisting of 239 real-life meetings, with a total duration of over 230 hours. We claim our dataset is versatile because we provide the annotations of topic segmentation, headlines, segmentation summaries, overall meeting summaries, and salient sentences for each meeting transcript. As such, the dataset can adapt to various summarization tasks or methods, including segmentation-based summarization, multi-granularity summarization and retrieval-then-generate summarization. Our analysis confirms the effectiveness and robustness of VCSum. We also provide a set of benchmark models regarding different downstream summarization tasks on VCSum to facilitate further research. The dataset and code will be released at https://github.com/hahahawu/VCSum.
翻译:与新闻和聊天摘要相比,会议摘要的发展因数据有限而严重受阻。为此,我们提出了一个多功能的中文会议摘要数据集VCSum,包含239场真实会议,总时长超过230小时。我们声称该数据集的多功能性在于,我们为每场会议记录提供了主题分割、标题、分割摘要、整体会议摘要以及显著句子的标注。因此,该数据集可适用于多种摘要任务或方法,包括基于分割的摘要、多粒度摘要和检索-生成式摘要。我们的分析证实了VCSum的有效性和鲁棒性。我们还提供了一组针对VCSum上不同下游摘要任务的基准模型,以促进进一步研究。数据集和代码将在https://github.com/hahahawu/VCSum发布。