Predicting Emergent Abilities with Infinite Resolution Evaluation

The scientific scale-up of large language models (LLMs) necessitates a comprehensive understanding of their scaling properties. However, the existing literature on the scaling properties only yields an incomplete answer: optimization loss decreases predictably as the model size increases, in line with established scaling law; yet no scaling law for task has been established and the task performances are far from predictable during scaling. Task performances typically show minor gains on small models until they improve dramatically once models exceed a size threshold, exemplifying the ``emergent abilities''. In this study, we discover that small models, although they exhibit minor performance, demonstrate critical and consistent task performance improvements that are not captured by conventional evaluation strategies due to insufficient measurement resolution. To measure such improvements, we introduce PassUntil, an evaluation strategy with theoretically infinite resolution, through massive sampling in the decoding phase. With PassUntil, we conduct a quantitative investigation into the scaling law of task performance. The investigation contains two parts. Firstly, a strict task scaling law that is not conventionally known to exist, is identified, enhancing the predictability of task performances. Remarkably, we are able to predict the performance of the 2.4B model on code generation with merely 0.05\% deviation before training starts, which is the first systematic attempt to verify predictable scaling proposed by GPT-4's report. Secondly, we are able to study emergent abilities quantitatively. We identify a kind of accelerated emergence whose scaling curve cannot be fitted by standard scaling law function and has a increasing speed. We then examine two hypothesis and imply that the ``multiple circuits hypothesis'' might be responsible for the accelerated emergence.

翻译：大规模语言模型的科学扩展需要全面理解其缩放特性。然而，现有关于缩放特性的文献仅提供了不完整的答案：随着模型规模增大，优化损失会按既定缩放定律可预测地下降；但尚未建立任务层面的缩放定律，且任务性能在扩展过程中远非可预测。任务性能通常在小模型上仅呈现微弱提升，直至模型超过某个规模阈值后才会显著改善，这种现象被称为"涌现能力"。本研究发现，尽管小模型性能表现有限，但其任务性能会呈现出关键且一致的改进，这种改进因传统评估策略的分辨率不足而未被捕捉。为测量此类改进，本文提出PassUntil评估策略——通过解码阶段的大规模采样实现理论上的无限分辨率。借助PassUntil，我们定量研究了任务性能的缩放定律，包含两部分：首先，识别出传统认知中不存在的严格任务缩放定律，提升了任务性能的可预测性。值得注意的是，我们能在训练开始前仅以0.05%的误差预测2.4B模型在代码生成任务上的性能——这是首次系统验证GPT-4报告中提出的"可预测缩放"设想。其次，我们得以定量研究涌现能力，识别出一种"加速涌现"现象：其缩放曲线无法用标准缩放定律函数拟合，且呈现加速增长趋势。我们检验了两个假说，结果暗示"多电路假说"可能是导致加速涌现的原因。