OpenWebMath: An Open Dataset of High-Quality Mathematical Web Text

There is growing evidence that pretraining on high quality, carefully thought-out tokens such as code or mathematics plays an important role in improving the reasoning abilities of large language models. For example, Minerva, a PaLM model finetuned on billions of tokens of mathematical documents from arXiv and the web, reported dramatically improved performance on problems that require quantitative reasoning. However, because all known open source web datasets employ preprocessing that does not faithfully preserve mathematical notation, the benefits of large scale training on quantitive web documents are unavailable to the research community. We introduce OpenWebMath, an open dataset inspired by these works containing 14.7B tokens of mathematical webpages from Common Crawl. We describe in detail our method for extracting text and LaTeX content and removing boilerplate from HTML documents, as well as our methods for quality filtering and deduplication. Additionally, we run small-scale experiments by training 1.4B parameter language models on OpenWebMath, showing that models trained on 14.7B tokens of our dataset surpass the performance of models trained on over 20x the amount of general language data. We hope that our dataset, openly released on the Hugging Face Hub, will help spur advances in the reasoning abilities of large language models.

翻译：越来越多的证据表明，在代码或数学等高质量、精心设计的标记上进行预训练，对于提升大语言模型的推理能力具有重要作用。例如，Minerva（一个基于PaLM的模型）在arXiv和网络上的数十亿数学文档标记上进行微调后，在需要定量推理的问题上取得了显著提升的性能。然而，由于所有已知的开源网络数据集都采用了无法忠实保留数学符号的预处理方法，研究社区无法获得在大规模定量网络文档上训练的好处。我们提出了OpenWebMath——一个受这些工作启发的开放数据集，包含来自Common Crawl的147亿个数学网页标记。我们详细描述了从HTML文档中提取文本和LaTeX内容、去除模板的方法，以及质量过滤和去重技术。此外，我们通过训练14亿参数的模型在OpenWebMath上进行小规模实验，结果表明，在我们的数据集上训练的模型（使用147亿标记）性能超过了使用20倍以上通用语言数据训练的模型。我们希望这个在Hugging Face Hub上公开的数据集，能推动大语言模型推理能力的进步。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日