SlimPajama-DC: Understanding Data Combinations for LLM Training

This paper aims to understand the impacts of various data combinations (e.g., web text, wikipedia, github, books) on the training of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T tokens RedPajama dataset contributed by Together. We've termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of high-quality/highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations of SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16$\times$ CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our models and the separate SlimPajama-DC datasets are available at: https://huggingface.co/MBZUAI-LLM and https://huggingface.co/datasets/cerebras/SlimPajama-627B.

翻译：本文旨在探究不同数据组合（如网络文本、维基百科、GitHub代码、书籍等）对基于SlimPajama数据集训练大语言模型的影响。SlimPajama是一个经过严格去重的多源数据集，源自Together发布的原始1.2万亿词元RedPajama数据集，经精细化处理和进一步去重后精简至6270亿词元。我们将本研究命名为SlimPajama-DC，通过实证分析揭示利用SlimPajama训练大语言模型时的基本特性与最佳实践。在研究过程中，我们发现了两个关键现象：(1) 全局去重与局部去重——分析并讨论了全局（跨不同数据源）和局部（单一数据源内部）去重对模型性能的影响；(2) 组合中高质量/高去重多源数据集的占比。为此，我们构建了六种SlimPajama数据配置，并使用配备Alibi和SwiGLU的1.3B Cerebras-GPT模型分别进行训练。最优配置模型相比使用相同训练词元数的RedPajama训练的1.3B模型取得了显著性能提升。所有1.3B模型均在Cerebras 16×CS-2集群上以bf16混合精度训练，总算力达80 PFLOP/s。我们进一步将发现（如全局去重后增加数据多样性至关重要）推广至采用大批次训练的7B模型。模型及独立的SlimPajama-DC数据集已发布于：https://huggingface.co/MBZUAI-LLM 和 https://huggingface.co/datasets/cerebras/SlimPajama-627B。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/