Large language model pre-training has become increasingly expensive, with most practitioners relying on scaling laws to allocate compute budgets for model size and training tokens, commonly referred to as Compute-Optimal or Chinchilla Optimal. In this paper, we hypothesize a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models, independent of the specific allocation to model size and dataset size. Using this unified scaling law, we predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
翻译:大型语言模型预训练的成本日益高昂,大多数从业者依赖于 scaling law(缩放定律)来分配模型规模和训练数据的计算预算,这通常被称为计算最优或Chinchilla最优方案。在本文中,我们提出了一种新的缩放定律假设,认为基于Transformer的模型性能主要取决于所投入的总计算量,而与模型规模和数据集规模的具体分配无关。利用这一统一缩放定律,我们预测:(a)对于推理效率而言,训练应优先选择较小的模型规模和较大的训练数据集;(b)假设现有的网络数据集被耗尽,扩大模型规模可能是进一步提升模型性能的唯一途径。