Despite recent advancements of large language models (LLMs), optimally predicting the model size for LLM pretraining or allocating optimal resources still remains a challenge. Several efforts have addressed the challenge by proposing different empirical scaling laws, but almost all of them are architecture-specific (dense or sparse). In this work we revisit existing empirical scaling laws and propose a generalized scaling law to provide a unified framework that is applicable to both dense and sparse large language models. We evaluate and compare our proposed scaling law with existing scaling laws and demonstrate that our proposed scaling law captures the scaling behavior of existing scaling laws. Further, we show an IsoFLOP comparison between our proposed scaling law and the state-of-the-art scaling law to illustrate the effectiveness of our proposed scaling law for Mixture-of-Expert (MoE)-based very large LLMs like DeepSeek-V3. Our proposed scaling law can be used to estimate the best model hyperparameters (Model size, Tokens and Compute) for a given sparsity or to identify the optimal sparsity for the given model hyperparameters.
翻译:尽管大语言模型(LLM)近期取得了显著进展,但在预训练过程中如何最优预测模型规模或分配计算资源仍是一个挑战。已有研究通过提出不同的经验缩放定律来应对这一挑战,但几乎所有定律都局限于特定架构(稠密或稀疏)。本研究重新审视现有经验缩放定律,提出一种通用缩放定律,为稠密与稀疏大语言模型提供统一框架。我们通过评估比较证明,所提出的缩放定律能够准确捕捉现有缩放定律的规模变化规律。进一步地,我们通过等计算量(IsoFLOP)对比实验,展示了所提缩放定律在DeepSeek-V3等基于专家混合(MoE)架构的超大规模LLM中的有效性。该缩放定律可用于估算给定稀疏度下的最佳模型超参数(模型规模、训练词元量与计算量),或在给定模型超参数下确定最优稀疏度。