This paper aims to understand the impacts of various data combinations (e.g., web text, wikipedia, github, books) on the training of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T tokens RedPajama dataset contributed by Together. We've termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of high-quality/highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations of SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16$\times$ CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our models and the separate SlimPajama-DC datasets are available at: https://huggingface.co/MBZUAI-LLM and https://huggingface.co/datasets/cerebras/SlimPajama-627B.
翻译:本文旨在探究不同数据组合(如网络文本、维基百科、GitHub代码、书籍等)对基于SlimPajama数据集训练大语言模型的影响。SlimPajama是一个经过严格去重的多源数据集,源自Together发布的原始1.2万亿词元RedPajama数据集,经精细化处理和进一步去重后精简至6270亿词元。我们将本研究命名为SlimPajama-DC,通过实证分析揭示利用SlimPajama训练大语言模型时的基本特性与最佳实践。在研究过程中,我们发现了两个关键现象:(1) 全局去重与局部去重——分析并讨论了全局(跨不同数据源)和局部(单一数据源内部)去重对模型性能的影响;(2) 组合中高质量/高去重多源数据集的占比。为此,我们构建了六种SlimPajama数据配置,并使用配备Alibi和SwiGLU的1.3B Cerebras-GPT模型分别进行训练。最优配置模型相比使用相同训练词元数的RedPajama训练的1.3B模型取得了显著性能提升。所有1.3B模型均在Cerebras 16×CS-2集群上以bf16混合精度训练,总算力达80 PFLOP/s。我们进一步将发现(如全局去重后增加数据多样性至关重要)推广至采用大批次训练的7B模型。模型及独立的SlimPajama-DC数据集已发布于:https://huggingface.co/MBZUAI-LLM 和 https://huggingface.co/datasets/cerebras/SlimPajama-627B。