During the development of large language models (LLMs), pre-training data play a critical role in shaping LLMs' capabilities. In recent years several large-scale and high-quality pre-training datasets have been released to accelerate the research of LLMs, including ChineseWebText1.0, C4, Pile, WanJuan, MAPCC and others. However, as LLMs continue to evolve, focus has increasingly shifted to domain-specific capabilities and safety concerns, making those previous coarse-grained texts insufficient for meeting training requirements. Furthermore, fine-grained information, such as quality, domain and toxicity, is becoming increasingly important in building powerful and reliable LLMs for various scenarios. To address these challenges, in this paper we propose a new tool-chain called MDFG-tool for constructing large-scale and high-quality Chinese datasets with multi-dimensional and fine-grained information. First, we employ manually crafted rules to discard explicit noisy texts from raw contents. Second, the quality evaluation model, domain classifier, and toxicity evaluation model are well-designed to assess the remaining cleaned data respectively. Finally, we integrate these three types of fine-grained information for each text. With this approach, we release the largest, high-quality and fine-grained Chinese text ChineseWebText2.0, which consists of 3.8TB and each text is associated with a quality score, domain labels, a toxicity label and a toxicity score, facilitating the LLM researchers to select data based on various types of fine-grained information. The data, codes and the tool-chain are available on this website https://github.com/CASIA-LM/ChineseWebText-2.0
翻译:在大语言模型(LLM)的发展过程中,预训练数据对于塑造LLM的能力起着至关重要的作用。近年来,为加速LLM的研究,已发布了多个大规模高质量的预训练数据集,包括ChineseWebText1.0、C4、Pile、WanJuan、MAPCC等。然而,随着LLM的持续演进,研究焦点日益转向领域特定能力与安全问题,使得先前那些粗粒度的文本已不足以满足训练需求。此外,诸如质量、领域和毒性等细粒度信息,对于构建适用于多种场景的强大且可靠的LLM正变得愈发重要。为应对这些挑战,本文提出了一种名为MDFG-tool的新工具链,用于构建具备多维细粒度信息的大规模高质量中文数据集。首先,我们采用人工设计的规则从原始内容中丢弃显式的噪声文本。其次,精心设计了质量评估模型、领域分类器和毒性评估模型,分别对剩余的清洗后数据进行评估。最后,我们为每条文本整合了这三类细粒度信息。通过这种方法,我们发布了规模最大、高质量且具备细粒度标注的中文文本数据集ChineseWebText2.0,其数据量达3.8TB,每条文本均关联一个质量分数、领域标签、一个毒性标签及一个毒性分数,便于LLM研究者根据各类细粒度信息筛选数据。数据、代码及工具链可通过此网站获取:https://github.com/CASIA-LM/ChineseWebText-2.0