There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, openly released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models.
翻译:越来越多的证据表明,在代码或数学等高质量、精心设计的标记上进行预训练,对于提升大语言模型的推理能力具有重要作用。例如,Minerva(一个基于PaLM的模型)在arXiv和网络上的数十亿数学文档标记上进行微调后,在需要定量推理的问题上取得了显著提升的性能。然而,由于所有已知的开源网络数据集都采用了无法忠实保留数学符号的预处理方法,研究社区无法获得在大规模定量网络文档上训练的好处。我们提出了OpenWebMath——一个受这些工作启发的开放数据集,包含来自Common Crawl的147亿个数学网页标记。我们详细描述了从HTML文档中提取文本和LaTeX内容、去除模板的方法,以及质量过滤和去重技术。此外,我们通过训练14亿参数的模型在OpenWebMath上进行小规模实验,结果表明,在我们的数据集上训练的模型(使用147亿标记)性能超过了使用20倍以上通用语言数据训练的模型。我们希望这个在Hugging Face Hub上公开的数据集,能推动大语言模型推理能力的进步。