Neural scaling laws predict how language model performance improves with increased compute. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation perplexity suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without assuming any bottleneck or functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 2.04% mean absolute error in predicting model accuracy on 66 downstream tasks -- a 38% reduction compared to logistic scaling laws (3.29% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling laws directly from data outperforms parametric alternatives.
翻译:神经缩放定律预测了语言模型性能如何随着计算量的增加而提升。虽然验证损失等聚合指标可以遵循平滑的幂律曲线,但个体下游任务却表现出多样化的缩放行为:有些任务单调改进,有些则趋于平稳,甚至有些会随规模扩大而性能下降。我们认为,从验证困惑度预测下游性能存在两个局限性:平均词元级损失会掩盖信号,且没有简单的参数族能够捕捉全部缩放行为谱系。为解决此问题,我们提出神经神经缩放定律(NeuNeu),这是一个将缩放定律预测构建为时间序列外推任务的神经网络。NeuNeu结合了从观测到的准确率轨迹中提取的时间上下文信息与词元级验证损失,无需假设任何瓶颈或函数形式即可学习预测未来性能。完全基于HuggingFace开源模型检查点进行训练后,NeuNeu在66个下游任务的模型准确率预测中实现了2.04%的平均绝对误差——相较于逻辑缩放定律(3.29% MAE)降低了38%。此外,NeuNeu能够零样本泛化到未见过的模型族、参数量级和下游任务。我们的研究表明,直接从数据预测下游缩放定律优于参数化替代方法。