Language models famously improve under a smooth scaling law, but some specific capabilities exhibit sudden breakthroughs in performance. Advocates of "emergence" view these capabilities as unlocked at a specific scale, but others attribute breakthroughs to superficial metric thresholding effects. We propose that breakthroughs are instead driven by continuous changes in the probability distribution of training outcomes when performance is bimodally distributed across random seeds. we show that different random seeds can produce either smooth or emergent scaling trends in synthetic length generalization tasks, multiple choice question answering, and grammatical generalization. We reveal that sharp breakthroughs in metrics are produced by underlying continuous changes in their distribution across seeds. These distributions may become abruptly bimodal at a capacity threshold but this threshold appears at scales well before most seeds achieve breakthrough. Our observations hold true even under continuous loss metrics, confirming that random variation must be considered when predicting a model's performance from its scale.
翻译:语言模型遵循平滑的缩放定律而改进,但某些特定能力会表现出性能的突然突破。"涌现"论的支持者认为这些能力在特定规模下被解锁,而其他研究者则将突破归因于表面化的度量阈值效应。我们提出,当性能在随机种子间呈现双峰分布时,训练结果的概率分布的连续变化才是驱动突破的真正原因。我们通过合成长度泛化任务、多项选择题回答以及语法泛化实验证明,不同的随机种子可以产生平滑或涌现的缩放趋势。研究发现,度量指标的急剧突破是由其在随机种子间分布的连续变化所导致的。这些分布可能在达到某个能力阈值时突然变为双峰分布,但该阈值出现在大多数种子实现突破之前的规模阶段。即使在连续损失度量下,我们的观察结论依然成立,这证实了在根据模型规模预测其性能时必须考虑随机变异的影响。