During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess the remaining relatively clean data, and each text is assigned a specific quality score. Finally, we can easily utilize an appropriate threshold to select the high-quality pre-training data for Chinese. Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds. We also release a much cleaner subset of 600 GB Chinese data with the quality exceeding 90%.
翻译:在大语言模型(LLMs)的发展过程中,预训练数据的规模与质量对塑造LLMs的能力起着关键作用。为加速LLMs研究,已有C4[1]、Pile[2]、RefinedWeb[3]和WanJuan[4]等多个大规模数据集公开发布。然而,现有语料库主要聚焦于英文,且缺乏从网页数据中提取纯净文本的完整工具链。此外,语料库的细粒度信息(如每段文本的质量)仍有所缺失。针对这些挑战,本文提出全新的完整工具链EvalWeb,用于从含噪网页数据中提取中文纯净文本。首先,与先前工作类似,采用人工制定的规则丢弃原始爬取网页内容中的显式噪声文本。其次,利用精心设计的评估模型处理剩余相对干净的数据,并为每段文本分配特定质量评分。最终,可便捷地通过适当阈值筛选高质量中文预训练数据。基于所提出的方法,我们发布了规模最大且最新的高质量中文网页文本数据集ChineseWebText,其规模达1.42 TB,每段文本均附带质量评分,便于LLM研究者根据所需质量阈值选择数据。同时,我们还发布了质量评分超过90%且更为纯净的600 GB中文数据子集。