A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework. Contributions include: (a) A statistical framework that relates cross-entropy loss of LLMs to competence on the basic skills that underlie language tasks. (b) Mathematical analysis showing that the Scaling Laws imply a strong form of inductive bias that allows the pre-trained model to learn very efficiently. We informally call this {\em slingshot generalization} since naively viewed it appears to give competence levels at skills that violate usual generalization theory. (c) A key example of slingshot generalization, that competence at executing tasks involving $k$-tuples of skills emerges essentially at the same scaling and same rate as competence on the elementary skills themselves.
翻译:当今人工智能产品的主要驱动力之一是:当语言模型的参数集和训练语料规模扩大时,新技能会涌现。这一现象尚未得到充分理解,且通过基于梯度的训练的数学分析进行机制解释似乎颇具难度。本文另辟蹊径,利用著名的(且经验性的)大语言模型缩放定律与一个简单的统计框架来分析涌现现象。主要贡献包括:(a) 一个将大语言模型交叉熵损失与语言任务基础技能能力相关联的统计框架;(b) 数学分析表明,缩放定律蕴含一种强形式的归纳偏置,使预训练模型能够实现高效学习。我们非正式地将其称为“弹弓式泛化”,因为从朴素视角看,它似乎赋予了模型违背常规泛化理论的技能能力水平;(c) 弹弓式泛化的一个关键实例:涉及 $k$ 元技能组合的任务执行能力,其涌现规模与基础技能本身的能力涌现规模及速度基本相同。