In the rapidly evolving digital era, there is an increasing demand for concise information as individuals seek to distil key insights from various sources. Recent attention from researchers on Multi-document Summarisation (MDS) has resulted in diverse datasets covering customer reviews, academic papers, medical and legal documents, and news articles. However, the English-centric nature of these datasets has created a conspicuous void for multilingual datasets in today's globalised digital landscape, where linguistic diversity is celebrated. Media platforms such as British Broadcasting Corporation (BBC) have disseminated news in 20+ languages for decades. With only 380 million people speaking English natively as their first language, accounting for less than 5% of the global population, the vast majority primarily relies on other languages. These facts underscore the need for inclusivity in MDS research, utilising resources from diverse languages. Recognising this gap, we present the Multilingual Dataset for Multi-document Summarisation (M2DS), which, to the best of our knowledge, is the first dataset of its kind. It includes document-summary pairs in five languages from BBC articles published during the 2010-2023 period. This paper introduces M2DS, emphasising its unique multilingual aspect, and includes baseline scores from state-of-the-art MDS models evaluated on our dataset.
翻译:在快速发展的数字时代,随着人们寻求从不同来源提炼关键信息,对简洁信息的需求日益增长。研究者近期对多文档摘要(MDS)的关注已催生出涵盖客户评论、学术论文、医疗法律文档及新闻文章等多种数据集。然而,这些数据集以英语为中心的特性,在当今崇尚语言多样性的全球化数字环境中,为多语言数据集留下了显著空白。诸如英国广播公司(BBC)等媒体平台数十年来一直以20余种语言发布新闻。全球仅3.8亿人以英语为母语,占比不足全球人口的5%,绝大多数人主要依赖其他语言。这些事实凸显了在MDS研究中纳入多语言资源的必要性。为填补这一空白,我们提出了面向多文档摘要的多语言数据集(M2DS),据我们所知,这是首个该类型数据集。它包含2010-2023年间BBC文章的五种语言文档-摘要对。本文介绍了M2DS,强调其独特的多语言特性,并提供了基于本数据集评估的先进MDS模型基线分数。