Timeline summarization (TLS) involves creating summaries of long-running events using dated summaries from numerous news articles. However, limited data availability has significantly slowed down the development of timeline summarization. In this paper, we introduce the CNTLS dataset, a versatile resource for Chinese timeline summarization. CNTLS encompasses 77 real-life topics, each with 2524 documents and summarizes nearly 60\% days duration compression on average all topics. We meticulously analyze the corpus using well-known metrics, focusing on the style of the summaries and the complexity of the summarization task. Specifically, we evaluate the performance of various extractive and generative summarization systems on the CNTLS corpus to provide benchmarks and support further research. To the best of our knowledge, CNTLS is the first Chinese timeline summarization dataset. The dataset and source code are released\footnote{Code and data available at: \emph{\url{https://github.com/OpenSUM/CNTLS}}.}.
翻译:时间线摘要(TLS)旨在利用众多新闻文章中的带日期摘要,对长期事件生成概括性总结。然而,有限的数据可用性显著阻碍了时间线摘要的发展。本文提出了CNTLS数据集,这是一个面向中文时间线摘要的多功能资源。CNTLS涵盖77个真实主题,每个主题包含2524篇文档,平均而言,所有主题的摘要天数压缩率接近60%。我们采用知名指标对语料库进行了细致分析,重点关注摘要的风格及摘要任务的复杂度。具体而言,我们评估了多种抽取式与生成式摘要系统在CNTLS语料库上的性能,以提供基准并支持后续研究。据我们所知,CNTLS是首个中文时间线摘要数据集。数据集与源代码已公开发布。