Pre-training Large Language Models (LLMs) require massive amounts of text data, and the performance of the LLMs typically correlates with the scale and quality of the datasets. This means that it may be challenging to build LLMs for smaller languages such as Nordic ones, where the availability of text corpora is limited. In order to facilitate the development of the LLMS in the Nordic languages, we curate a high-quality dataset consisting of 1.2TB of text, in all of the major North Germanic languages (Danish, Icelandic, Norwegian, and Swedish), as well as some high-quality English data. This paper details our considerations and processes for collecting, cleaning, and filtering the dataset.
翻译:预训练大型语言模型(LLM)需要海量文本数据,而LLM的性能通常与数据集的规模和质量相关。这意味着为北欧语言等小语种构建LLM可能面临挑战,因为其文本语料资源有限。为促进北欧语言中LLM的开发,我们整理了一个包含1.2TB文本的高质量数据集,涵盖所有主要北日耳曼语言(丹麦语、冰岛语、挪威语和瑞典语)以及部分高质量英语数据。本文详细阐述了我们在收集、清洗和筛选该数据集时的考量与流程。