Large-scale corpora play a vital role in the construction of large language models (LLMs). However, existing LLMs exhibit limited abilities in understanding low-resource languages, including the minority languages in China, due to a lack of training data. To improve the accessibility of these languages, we present MC^2, a Multilingual Corpus of Minority Languages in China, which is the largest open-source corpus so far. It encompasses four underrepresented languages, i.e., Tibetan, Uyghur, Kazakh in the Kazakh Arabic script, and Mongolian in the traditional Mongolian script. Notably, two writing systems in MC^2 are long neglected in previous corpora. As we identify serious contamination in the low-resource language split in the existing multilingual corpora, we propose a quality-centric solution for collecting MC^2, prioritizing quality and accuracy while enhancing representativeness and diversity. By in-depth analysis, we demonstrate the new research challenges MC^2 brings, such as long-text modeling and multiplicity of writing systems. We hope MC^2 can help enhance the equity of the underrepresented languages in China and provide a reliable data foundation for further research on low-resource languages.
翻译:大规模语料库在大语言模型(LLMs)构建中扮演着关键角色。然而,由于缺乏训练数据,现有LLMs对低资源语言的理解能力有限,包括中国少数民族语言。为提升这些语言的可及性,我们提出MC^2——中国少数民族语言多语语料库,这是迄今最大的开源语料库。它涵盖四种代表性不足的语言:藏语、维吾尔语、哈萨克语(采用哈萨克阿拉伯文)以及蒙古语(采用传统蒙古文)。值得注意的是,MC^2中包含了此前语料库长期忽略的两种文字系统。针对现有多语语料库中低资源语言划分存在的严重数据污染问题,我们提出一种以质量为核心的语料收集方案,在优先保证质量与准确性的同时提升代表性与多样性。通过深入分析,我们揭示了MC^2带来的新研究挑战,如长文本建模与文字系统的多样性。我们希望MC^2能助力增强中国代表性不足语言的公平性,并为低资源语言的进一步研究提供可靠的数据基础。