Pre-training Large Language Models (LLMs) on high-quality, meticulously curated datasets is widely recognized as critical for enhancing their performance and generalization capabilities. This study explores the untapped potential of Common Crawl as a comprehensive and flexible resource for pre-training LLMs, addressing both general-purpose language understanding and specialized domain knowledge. We introduce RedStone, an innovative and scalable pipeline engineered to extract and process data from Common Crawl, facilitating the creation of extensive and varied pre-training datasets. Unlike traditional datasets, which often require expensive curation and domain-specific expertise, RedStone leverages the breadth of Common Crawl to deliver datasets tailored to a wide array of domains. In this work, we exemplify its capability by constructing pre-training datasets across multiple fields, including general language understanding, code, mathematics, and question-answering tasks. The flexibility of RedStone allows for easy adaptation to other specialized domains, significantly lowering the barrier to creating valuable domain-specific datasets. Our findings demonstrate that Common Crawl, when harnessed through effective pipelines like RedStone, can serve as a rich, renewable source of pre-training data, unlocking new avenues for domain adaptation and knowledge discovery in LLMs. This work also underscores the importance of innovative data acquisition strategies and highlights the role of web-scale data as a powerful resource in the continued evolution of LLMs. RedStone code and data samples will be publicly available at \url{https://aka.ms/redstone}.
翻译:在大语言模型(LLMs)的预训练阶段,使用高质量且经过精心筛选的数据集被广泛认为是提升其性能与泛化能力的关键。本研究探索了Common Crawl作为一个全面且灵活的资源,在LLMs预训练中尚未被充分挖掘的潜力,旨在同时服务于通用语言理解与专业领域知识的学习。我们提出了RedStone,这是一个创新且可扩展的流水线,专为从Common Crawl中提取和处理数据而设计,以促进构建大规模、多样化的预训练数据集。与通常需要昂贵人工标注和领域专业知识构建的传统数据集不同,RedStone利用Common Crawl的广泛覆盖性,提供可适配于众多领域的定制化数据集。在本工作中,我们通过构建涵盖多个领域的预训练数据集(包括通用语言理解、代码、数学和问答任务)来展示其能力。RedStone的灵活性使其能够轻松适配其他专业领域,显著降低了创建有价值领域特定数据集的门槛。我们的研究结果表明,通过如RedStone这样的高效流水线进行利用,Common Crawl可以成为一个丰富、可再生的预训练数据源,为LLMs的领域适应与知识发现开辟新途径。这项工作也强调了创新数据获取策略的重要性,并凸显了网络规模数据作为推动LLMs持续演进的重要资源角色。RedStone的代码与数据样本将在 \url{https://aka.ms/redstone} 公开提供。