Large language model pre-training has become increasingly expensive, with most practitioners relying on scaling laws to allocate compute budgets for model size and training tokens, commonly referred to as Compute-Optimal or Chinchilla Optimal. In this paper, we hypothesize a new scaling law that suggests model performance depends mostly on the amount of compute spent for transformer-based models, independent of the specific allocation to model size and dataset size. Using this unified scaling law, we predict that (a) for inference efficiency, training should prioritize smaller model sizes and larger training datasets, and (b) assuming the exhaustion of available web datasets, scaling the model size might be the only way to further improve model performance.
翻译:大语言模型预训练已变得日益昂贵,大多数从业者依赖缩放定律来分配模型规模和训练数据的计算预算,这通常被称为计算最优或Chinchilla最优。本文假设了一种新的缩放定律,该定律表明基于Transformer的模型性能主要取决于计算总量,而与具体分配至模型规模和数据集规模无关。利用这一统一缩放定律,我们预测:(a)为提升推理效率,训练应优先采用更小的模型规模和更大的训练数据集;(b)假设现有网络数据集已用尽,扩大模型规模可能是进一步提升模型性能的唯一途径。