Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Luca Soldaini,Rodney Kinney,Akshita Bhagia,Dustin Schwenk,David Atkinson,Russell Authur,Ben Bogin,Khyathi Chandu,Jennifer Dumas,Yanai Elazar,Valentin Hofmann,Ananya Harsh Jha,Sachin Kumar,Li Lucy,Xinxi Lyu,Nathan Lambert,Ian Magnusson,Jacob Morrison,Niklas Muennighoff,Aakanksha Naik,Crystal Nam,Matthew E. Peters,Abhilasha Ravichander,Kyle Richardson,Zejiang Shen,Emma Strubell,Nishant Subramani,Oyvind Tafjord,Pete Walsh,Luke Zettlemoyer,Noah A. Smith,Hannaneh Hajishirzi,Iz Beltagy,Dirk Groeneveld,Jesse Dodge,Kyle Lo

from arxiv, Dataset available at: https://huggingface.co/datasets/allenai/dolma

Language models have become a critical technology to tackling a wide range of natural language processing tasks, yet many details about how the best-performing language models were developed are not reported. In particular, information about their pretraining corpora is seldom discussed: commercial language models rarely provide any information about their data; even open models rarely release datasets they are trained on, or an exact recipe to reproduce them. As a result, it is challenging to conduct certain threads of language modeling research, such as understanding how training data impacts model capabilities and shapes their limitations. To facilitate open research on language model pretraining, we release Dolma, a three trillion tokens English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. In addition, we open source our data curation toolkit to enable further experimentation and reproduction of our work. In this report, we document Dolma, including its design principles, details about its construction, and a summary of its contents. We interleave this report with analyses and experimental results from training language models on intermediate states of Dolma to share what we have learned about important data curation practices, including the role of content or quality filters, deduplication, and multi-source mixing. Dolma has been used to train OLMo, a state-of-the-art, open language model and framework designed to build and study the science of language modeling.

翻译：语言模型已成为处理广泛自然语言处理任务的关键技术，然而关于最佳性能语言模型开发过程的诸多细节尚未公开。特别是其预训练语料库的相关信息鲜少被讨论：商业语言模型几乎不提供任何数据信息；即使是开源模型，也极少发布其训练所用的数据集或可复现该数据集的精确流程。因此，开展诸如理解训练数据如何影响模型能力及其局限性等语言模型研究方向变得极具挑战性。为促进语言模型预训练的开放研究，我们发布了Dolma——一个包含三万亿词元的英文语料库，其构建来源涵盖网络内容、科学论文、代码、公有领域书籍、社交媒体及百科全书材料等多源混合数据。此外，我们开源了数据整理工具包以支持后续实验和成果复现。本报告详细记录了Dolma的设计原则、构建流程及内容总结，并穿插了基于Dolma中间状态训练语言模型的实验分析结果，分享我们在内容/质量过滤、去重及多源混合等重要数据整理实践中的研究发现。该语料库已成功用于训练OLMo——一个旨在构建和研究语言模型科学的前沿开放语言模型及框架。