While natural language processing tools have been developed extensively for some of the world's languages, a significant portion of the world's over 7000 languages are still neglected. One reason for this is that evaluation datasets do not yet cover a wide range of languages, including low-resource and endangered ones. We aim to address this issue by creating a text classification dataset encompassing a large number of languages, many of which currently have little to no annotated data available. We leverage parallel translations of the Bible to construct such a dataset by first developing applicable topics and employing a crowdsourcing tool to collect annotated data. By annotating the English side of the data and projecting the labels onto other languages through aligned verses, we generate text classification datasets for more than 1500 languages. We extensively benchmark several existing multilingual language models using our dataset. To facilitate the advancement of research in this area, we will release our dataset and code.
翻译:虽然自然语言处理工具已针对世界上部分语言进行了广泛开发,但全球7000多种语言中仍有相当一部分被忽视。其中一个原因是评估数据集尚未覆盖包括低资源语言和濒危语言在内的广泛语种。为解决这一问题,我们构建了一个涵盖大量语言的文本分类数据集——其中许多语言目前几乎没有或完全缺乏标注数据。我们利用《圣经》的平行翻译语料,通过先开发适用的主题类别并借助众包工具收集标注数据,最终构建该数据集。通过标注数据的英文部分,并经由对齐的经文将标签投射至其他语言,我们为超过1500种语言生成了文本分类数据集。我们使用该数据集对多个现有多语言模型进行了全面基准测试。为促进该领域研究的进展,我们将公开数据集及代码。