M2DS: Multilingual Dataset for Multi-document Summarisation

In the rapidly evolving digital era, there is an increasing demand for concise information as individuals seek to distil key insights from various sources. Recent attention from researchers on Multi-document Summarisation (MDS) has resulted in diverse datasets covering customer reviews, academic papers, medical and legal documents, and news articles. However, the English-centric nature of these datasets has created a conspicuous void for multilingual datasets in today's globalised digital landscape, where linguistic diversity is celebrated. Media platforms such as British Broadcasting Corporation (BBC) have disseminated news in 20+ languages for decades. With only 380 million people speaking English natively as their first language, accounting for less than 5% of the global population, the vast majority primarily relies on other languages. These facts underscore the need for inclusivity in MDS research, utilising resources from diverse languages. Recognising this gap, we present the Multilingual Dataset for Multi-document Summarisation (M2DS), which, to the best of our knowledge, is the first dataset of its kind. It includes document-summary pairs in five languages from BBC articles published during the 2010-2023 period. This paper introduces M2DS, emphasising its unique multilingual aspect, and includes baseline scores from state-of-the-art MDS models evaluated on our dataset.

翻译：在快速发展的数字时代，随着人们寻求从不同来源提炼关键信息，对简洁信息的需求日益增长。研究者近期对多文档摘要（MDS）的关注已催生出涵盖客户评论、学术论文、医疗法律文档及新闻文章等多种数据集。然而，这些数据集以英语为中心的特性，在当今崇尚语言多样性的全球化数字环境中，为多语言数据集留下了显著空白。诸如英国广播公司（BBC）等媒体平台数十年来一直以20余种语言发布新闻。全球仅3.8亿人以英语为母语，占比不足全球人口的5%，绝大多数人主要依赖其他语言。这些事实凸显了在MDS研究中纳入多语言资源的必要性。为填补这一空白，我们提出了面向多文档摘要的多语言数据集（M2DS），据我们所知，这是首个该类型数据集。它包含2010-2023年间BBC文章的五种语言文档-摘要对。本文介绍了M2DS，强调其独特的多语言特性，并提供了基于本数据集评估的先进MDS模型基线分数。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日