Predicting Emergent Abilities with Infinite Resolution Evaluation

The scientific scale-up of large language models (LLMs) necessitates a comprehensive understanding of their scaling properties. However, the existing literature on the scaling properties only yields an incomplete answer: optimization loss decreases predictably as the model size increases, in line with established scaling law; yet no scaling law for task has been established and the task performances are far from predictable during scaling. Task performances typically show minor gains on small models until they improve dramatically once models exceed a size threshold, exemplifying the ``emergent abilities''. In this study, we discover that small models, although they exhibit minor performance, demonstrate critical and consistent task performance improvements that are not captured by conventional evaluation strategies due to insufficient measurement resolution. To measure such improvements, we introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. With PassUntil, we conduct a quantitative investigation into the scaling law of task performance. The investigation contains two parts. Firstly, a strict task scaling law that is not conventionally known to exist, is identified, enhancing the predictability of task performances. Remarkably, we are able to predict the performance of the 2.4B model on code generation with merely 0.05\% deviation before training starts, which is the first systematic attempt to verify predictable scaling proposed by GPT-4's report. Secondly, we are able to study emergent abilities quantitatively. We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function and has a increasing speed. We then examine two hypothesis and imply that the ``multiple circuits hypothesis'' might be responsible for the accelerated emergence.

翻译：大语言模型的科学扩展需要对其扩展性质有全面理解。然而，现有关于扩展性质的文献仅给出不完整的答案：根据已建立的扩展律，随着模型规模增大，优化损失可预测地降低；但任务层面尚未建立相应扩展律，且任务性能在扩展过程中远不可预测。任务性能通常在小模型上仅呈现微小提升，直到模型超过规模阈值后才会显著改善，这体现了"涌现能力"。本研究发现：小模型虽表现出较低性能，但存在关键且一致的任务性能改进——由于测量分辨率不足，传统评估策略无法捕捉这些改进。为测量此类改进，我们提出PassUntil评估策略，通过在解码阶段进行大规模采样，实现理论上无限的分辨率。借助PassUntil，我们定量研究了任务性能的扩展律。研究包含两部分：首先，识别出传统上未知的严格任务扩展律，增强了任务性能的可预测性。值得注意的是，我们能在训练开始前以仅0.05%的偏差预测2.4B模型在代码生成上的性能——这是首次系统验证GPT-4报告提出的可预测扩展假设。其次，我们得以定量研究涌现能力。识别出一种加速涌现现象，其扩展曲线无法用标准扩展律函数拟合，且具有递增速率。随后检验两种假说，结果表明"多重回路假说"可能是加速涌现的成因。