Recent research has highlighted the importance of dataset size in scaling language models. However, large language models (LLMs) are notoriously token-hungry during pre-training, and high-quality text data on the web is approaching its scaling limit for LLMs. To further enhance LLMs, a straightforward approach is to repeat the pre-training data for additional epochs. In this study, we empirically investigate three key aspects under this approach. First, we explore the consequences of repeating pre-training data, revealing that the model is susceptible to overfitting, leading to multi-epoch degradation. Second, we examine the key factors contributing to multi-epoch degradation, finding that significant factors include dataset size, model parameters, and training objectives, while less influential factors consist of dataset quality and model FLOPs. Finally, we explore whether widely used regularization can alleviate multi-epoch degradation. Most regularization techniques do not yield significant improvements, except for dropout, which demonstrates remarkable effectiveness but requires careful tuning when scaling up the model size. Additionally, we discover that leveraging mixture-of-experts (MoE) enables cost-effective and efficient hyper-parameter tuning for computationally intensive dense LLMs with comparable trainable parameters, potentially impacting efficient LLM development on a broader scale.
翻译:最近的研究强调了数据集规模在扩展语言模型中的重要性。然而,大型语言模型在预训练期间消耗大量令牌,而网络上的高质量文本数据正接近其扩展极限。为了进一步改进大型语言模型,一种直接的方法是重复使用预训练数据进行额外轮次训练。在本研究中,我们从实证角度探讨了这一方法下的三个关键方面。首先,我们研究了重复预训练数据的后果,发现模型容易过拟合,从而导致多轮退化。其次,我们分析了导致多轮退化的关键因素,发现显著因素包括数据集规模、模型参数和训练目标,而较少影响的因素包括数据质量和模型FLOPs。最后,我们探讨了广泛使用的正则化方法能否缓解多轮退化。大多数正则化技术并未带来显著改进,但Dropout表现出显著效果,不过在扩展模型大小时需要仔细调整。此外,我们发现利用混合专家模型可以实现对计算密集型稠密大型语言模型的成本高效且高效的超参数调整,并具有可比较的可训练参数,可能对更广泛的高效大型语言模型开发产生影响。