A major driver of AI products today is the fact that new skills emerge in language models when their parameter set and training corpora are scaled up. This phenomenon is poorly understood, and a mechanistic explanation via mathematical analysis of gradient-based training seems difficult. The current paper takes a different approach, analysing emergence using the famous (and empirical) Scaling Laws of LLMs and a simple statistical framework. Contributions include: (a) A statistical framework that relates cross-entropy loss of LLMs to competence on the basic skills that underlie language tasks. (b) Mathematical analysis showing that the Scaling Laws imply a strong form of inductive bias that allows the pre-trained model to learn very efficiently. We informally call this {\em slingshot generalization} since naively viewed it appears to give competence levels at skills that violate usual generalization theory. (c) A key example of slingshot generalization, that competence at executing tasks involving $k$-tuples of skills emerges essentially at the same scaling and same rate as competence on the elementary skills themselves.
翻译:当今AI产品的主要驱动力之一是:当语言模型的参数集和训练语料规模扩大时,新技能会涌现。这一现象至今尚缺乏深入理解,而通过基于梯度的训练进行数学分析的机理阐释似乎颇具难度。本文另辟蹊径,借助著名的大语言模型(LLM)经验性缩放定律与简洁的统计框架来分析技能涌现现象。本文贡献包括:(a) 建立将LLM交叉熵损失与语言任务基础技能能力相关联的统计框架;(b) 通过数学分析表明,缩放定律蕴含强形式的归纳偏置,使预训练模型能够高效学习,我们将其非正式地称为"弹弓泛化",因为从表面看,它似乎赋予了超越常规泛化理论的技能掌握水平;(c) 弹弓泛化的一个关键示例:涉及$k$元组技能的任务执行能力,其涌现所需的规模与速率基本等同于基础技能本身的能力水平。