Deep learning models can exhibit what appears to be a sudden ability to solve a new problem as training time ($T$), training data ($D$), or model size ($N$) increases, a phenomenon known as emergence. In this paper, we present a framework where each new ability (a skill) is represented as a basis function. We solve a simple multi-linear model in this skill-basis, finding analytic expressions for the emergence of new skills, as well as for scaling laws of the loss with training time, data size, model size, and optimal compute ($C$). We compare our detailed calculations to direct simulations of a two-layer neural network trained on multitask sparse parity, where the tasks in the dataset are distributed according to a power-law. Our simple model captures, using a single fit parameter, the sigmoidal emergence of multiple new skills as training time, data size or model size increases in the neural network.
翻译:深度学习模型在训练时间($T$)、训练数据量($D$)或模型规模($N$)增加时,可能表现出突然解决新问题的能力,这种现象被称为涌现。本文中,我们提出一个框架,将每种新能力(技能)表示为基函数。我们在该技能基上求解一个简单的多线性模型,得到了新技能涌现以及损失函数随训练时间、数据量、模型规模和最优计算量($C$)变化的缩放定律的解析表达式。我们将详细计算结果与训练多任务稀疏奇偶性问题的双层神经网络的直接模拟进行对比,其中数据集中的任务服从幂律分布。我们的简单模型仅使用单一拟合参数,就能捕捉到神经网络中随着训练时间、数据量或模型规模增加,多个新技能呈S形涌现的特征。