During the development of large language models (LLMs), the scale and quality of the pre-training data play a crucial role in shaping LLMs' capabilities. To accelerate the research of LLMs, several large-scale datasets, such as C4 [1], Pile [2], RefinedWeb [3] and WanJuan [4], have been released to the public. However, most of the released corpus focus mainly on English, and there is still lack of complete tool-chain for extracting clean texts from web data. Furthermore, fine-grained information of the corpus, e.g. the quality of each text, is missing. To address these challenges, we propose in this paper a new complete tool-chain EvalWeb to extract Chinese clean texts from noisy web data. First, similar to previous work, manually crafted rules are employed to discard explicit noisy texts from the raw crawled web contents. Second, a well-designed evaluation model is leveraged to assess the remaining relatively clean data, and each text is assigned a specific quality score. Finally, we can easily utilize an appropriate threshold to select the high-quality pre-training data for Chinese. Using our proposed approach, we release the largest and latest large-scale high-quality Chinese web text ChineseWebText, which consists of 1.42 TB and each text is associated with a quality score, facilitating the LLM researchers to choose the data according to the desired quality thresholds. We also release a much cleaner subset of 600 GB Chinese data with the quality exceeding 90%.
翻译:在大语言模型(LLMs)的开发过程中,预训练数据的规模和质量对塑造LLMs的能力起着至关重要的作用。为加速LLMs研究,C4 [1]、Pile [2]、RefinedWeb [3]和WanJuan [4]等多个大规模数据集已向公众开放。然而,已发布的语料库主要侧重于英文,且目前仍缺乏从网络数据中提取纯净文本的完整工具链。此外,语料库的细粒度信息(例如每篇文本的质量)也尚未完善。为解决这些挑战,本文提出了一种名为EvalWeb的完整工具链,用于从嘈杂的网络数据中提取中文纯净文本。首先,与先前工作类似,我们采用人工制定的规则从原始抓取的网络内容中丢弃明显的噪声文本。其次,利用精心设计的评估模型对剩余相对干净的数据进行质量评估,并为每篇文本分配具体的质量分数。最后,我们可以轻松选择合适的阈值来筛选高质量的中文预训练数据。通过所提出的方法,我们发布了最大规模、最新的高质量中文网络文本ChineseWebText,该数据集包含1.42 TB的数据,并且每篇文本都附带质量分数,便于LLM研究者根据所需质量阈值选择数据。我们还发布了一个更纯净的中文数据子集,规模为600 GB,其文本质量超过90%。