Recent studies have put into question the belief that emergent abilities in language models are exclusive to large models. This skepticism arises from two observations: 1) smaller models can also exhibit high performance on emergent abilities and 2) there is doubt on the discontinuous metrics used to measure these abilities. In this paper, we propose to study emergent abilities in the lens of pre-training loss, instead of model size or training compute. We demonstrate that the models with the same pre-training loss, but different model and data sizes, generate the same performance on various downstream tasks. We also discover that a model exhibits emergent abilities on certain tasks -- regardless of the continuity of metrics -- when its pre-training loss falls below a specific threshold. Before reaching this threshold, its performance remains at the level of random guessing. This inspires us to redefine emergent abilities as those that manifest in models with lower pre-training losses, highlighting that these abilities cannot be predicted by merely extrapolating the performance trends of models with higher pre-training losses.
翻译:近期研究对“语言模型涌现能力仅存在于大模型”这一观点提出了质疑。此质疑源于两个观察:1)较小模型在涌现能力上也能展现高性能;2)用于衡量这些能力的非连续指标存疑。本文提出从预训练损失视角(而非模型规模或训练计算量)研究涌现能力。我们证明:具有相同预训练损失但不同模型规模与数据规模的模型,能在各类下游任务上产生相同性能。我们还发现:当模型预训练损失低于特定阈值时,无论指标是否连续,该模型都会在特定任务上展现涌现能力——在此之前其性能始终停留在随机猜测水平。这启发我们将涌现能力重新定义为:仅在具有较低预训练损失的模型中显现的能力,并强调这些能力无法通过简单外推高预训练损失模型的性能趋势进行预测。