State-of-the-art LLMs are powered by scaling -- scaling model size, dataset size and cluster size. It is economically infeasible to extensively tune hyperparameter for the largest runs. Instead, approximately optimal hyperparameters must be inferred or \textit{transferred} from smaller experiments. Hyperparameter transfer across model sizes has been studied in Yang et al. However, hyperparameter transfer across dataset size -- or token horizon -- has not been studied yet. To remedy this we conduct a large scale empirical study on how optimal learning rate (LR) depends on token horizon in LLM training. We first demonstrate that the optimal LR changes significantly with token horizon -- longer training necessitates smaller LR. Secondly we demonstrate the the optimal LR follows a scaling law, and that the optimal LR for longer horizons can be accurately estimated from shorter horizons via such scaling laws. We also provide a rule-of-thumb for transferring LR across token horizons with zero overhead over current practices. Lastly we provide evidence that LLama-1 used too high LR, and estimate the performance hit from this. We thus argue that hyperparameter transfer across data size is an important and overlooked component of LLM training.
翻译:最先进的大型语言模型(LLM)由规模化驱动——包括模型规模、数据集规模和集群规模的扩大。对于最大规模的训练任务,进行详尽的超参数调优在经济上是不可行的。因此,必须从小规模实验中推断或\textit{迁移}近似最优的超参数。Yang等人的研究已经探讨了跨模型规模的超参数迁移。然而,跨数据集规模——即令牌视界——的超参数迁移尚未得到研究。为弥补这一空白,我们开展了一项大规模实证研究,探讨在LLM训练中,最优学习率(LR)如何依赖于令牌视界。我们首先证明,最优学习率随令牌视界显著变化——更长的训练需要更小的学习率。其次,我们证明最优学习率遵循缩放定律,并且可以通过此类缩放定律,从较短视界准确估计较长视界的最优学习率。我们还提供了一个经验法则,用于在现有实践零额外开销的情况下,跨令牌视界迁移学习率。最后,我们提供证据表明LLama-1使用了过高的学习率,并估算了由此带来的性能损失。因此,我们认为跨数据规模的超参数迁移是LLM训练中一个重要但被忽视的组成部分。