CulturaX: A Cleaned, Enormous, and Multilingual Dataset for Large Language Models in 167 Languages

The driving factors behind the development of large language models (LLMs) with impressive learning capabilities are their colossal model sizes and extensive training datasets. Along with the progress in natural language processing, LLMs have been frequently made accessible to the public to foster deeper investigation and applications. However, when it comes to training datasets for these LLMs, especially the recent state-of-the-art models, they are often not fully disclosed. Creating training data for high-performing LLMs involves extensive cleaning and deduplication to ensure the necessary level of quality. The lack of transparency for training data has thus hampered research on attributing and addressing hallucination and bias issues in LLMs, hindering replication efforts and further advancements in the community. These challenges become even more pronounced in multilingual learning scenarios, where the available multilingual text datasets are often inadequately collected and cleaned. Consequently, there is a lack of open-source and readily usable dataset to effectively train LLMs in multiple languages. To overcome this issue, we present CulturaX, a substantial multilingual dataset with 6.3 trillion tokens in 167 languages, tailored for LLM development. Our dataset undergoes meticulous cleaning and deduplication through a rigorous pipeline of multiple stages to accomplish the best quality for model training, including language identification, URL-based filtering, metric-based cleaning, document refinement, and data deduplication. CulturaX is fully released to the public in HuggingFace to facilitate research and advancements in multilingual LLMs: https://huggingface.co/datasets/uonlp/CulturaX.

翻译：大型语言模型（LLM）展现出卓越学习能力的关键驱动因素在于其庞大的模型规模与丰富的训练数据集。随着自然语言处理领域的进步，LLM已频繁向公众开放，以促进更深入的研究与应用。然而，在涉及这些LLM（尤其是近期最先进的模型）的训练数据集时，相关信息往往未被完全公开。为高性能LLM创建训练数据需要经过广泛的清洗与去重流程，以确保必要的质量水平。训练数据透明度的缺失因此阻碍了关于LLM中幻觉与偏差问题的归因与解决研究，制约了重现工作及社区的进一步发展。在多语言学习场景中，这些挑战尤为突出——现有可用的多语言文本数据集常存在收集不充分、清洗不完善的问题。因此，当前缺乏一种可直接使用的开源数据集来有效训练多语言LLM。为解决此问题，我们提出了CulturaX——一个包含167种语言、总计6.3万亿个token的大型多语言数据集，专为LLM开发设计。我们的数据集通过严苛的多阶段流水线进行精心清洗与去重，以实现模型训练的最佳质量：包括语言识别、基于URL的过滤、基于指标的清洗、文档精炼及数据去重。CulturaX已在HuggingFace平台全面开源，以促进多语言LLM的研究与进步：https://huggingface.co/datasets/uonlp/CulturaX。