RedPajama: an Open Dataset for Training Large Language Models

Maurice Weber,Daniel Fu,Quentin Anthony,Yonatan Oren,Shane Adams,Anton Alexandrov,Xiaozhong Lyu,Huu Nguyen,Xiaozhe Yao,Virginia Adams,Ben Athiwaratkun,Rahul Chalamala,Kezhen Chen,Max Ryabinin,Tri Dao,Percy Liang,Christopher Ré,Irina Rish,Ce Zhang

from arxiv, 38th Conference on Neural Information Processing Systems (NeurIPS 2024) Track on Datasets and Benchmarks

Large language models are increasingly becoming a cornerstone technology in artificial intelligence, the sciences, and society as a whole, yet the optimal strategies for dataset composition and filtering remain largely elusive. Many of the top-performing models lack transparency in their dataset curation and model development processes, posing an obstacle to the development of fully open language models. In this paper, we identify three core data-related challenges that must be addressed to advance open-source language models. These include (1) transparency in model development, including the data curation process, (2) access to large quantities of high-quality data, and (3) availability of artifacts and metadata for dataset curation and analysis. To address these challenges, we release RedPajama-V1, an open reproduction of the LLaMA training dataset. In addition, we release RedPajama-V2, a massive web-only dataset consisting of raw, unfiltered text data together with quality signals and metadata. Together, the RedPajama datasets comprise over 100 trillion tokens spanning multiple domains and with their quality signals facilitate the filtering of data, aiming to inspire the development of numerous new datasets. To date, these datasets have already been used in the training of strong language models used in production, such as Snowflake Arctic, Salesforce's XGen and AI2's OLMo. To provide insight into the quality of RedPajama, we present a series of analyses and ablation studies with decoder-only language models with up to 1.6B parameters. Our findings demonstrate how quality signals for web data can be effectively leveraged to curate high-quality subsets of the dataset, underscoring the potential of RedPajama to advance the development of transparent and high-performing language models at scale.

翻译：大型语言模型正日益成为人工智能、科学乃至整个社会的基石技术，然而数据集构建与筛选的最优策略在很大程度上仍不明确。许多性能领先的模型在其数据集整理和模型开发过程方面缺乏透明度，这阻碍了完全开放语言模型的发展。本文指出了推动开源语言模型发展必须解决的三个核心数据相关挑战，包括：（1）模型开发过程的透明度，含数据整理流程；（2）获取海量高质量数据；（3）提供用于数据集整理与分析的数据制品及元数据。为应对这些挑战，我们发布了RedPajama-V1——一个对LLaMA训练数据集的开放复现版本。此外，我们还发布了RedPajama-V2——一个仅包含网页数据的超大规模数据集，由未经筛选的原始文本数据及其质量信号与元数据构成。RedPajama系列数据集总计包含超过100万亿词元，涵盖多个领域，其附带的质量信号有助于数据筛选，旨在激发众多新数据集的开发。截至目前，这些数据集已成功用于训练投入实际生产的强大语言模型，例如Snowflake Arctic、Salesforce的XGen以及AI2的OLMo。为深入评估RedPajama的数据质量，我们基于参数量达16亿的解码器专用语言模型开展了一系列分析与消融实验。研究结果表明，网页数据的质量信号可有效用于构建高质量数据集子集，这彰显了RedPajama在推动大规模透明化高性能语言模型发展方面的潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日