Deep learning models can exhibit what appears to be a sudden ability to solve a new problem as training time, training data, or model size increases, a phenomenon known as emergence. In this paper, we present a framework where each new ability (a skill) is represented as a basis function. We solve a simple multi-linear model in this skill-basis, finding analytic expressions for the emergence of new skills, as well as for scaling laws of the loss with training time, data size, model size, and optimal compute ($C$). We compare our detailed calculations to direct simulations of a two-layer neural network trained on multitask sparse parity, where the tasks in the dataset are distributed according to a power-law. Our simple model captures, using a single fit parameter, the sigmoidal emergence of multiple new skills as training time, data size or model size increases in the neural network.
翻译:深度学习模型在训练时间、训练数据或模型规模增加时,可能表现出看似突然获得解决新问题的能力,这种现象被称为涌现。本文提出一个框架,将每种新能力(即技能)表示为基函数。我们在该技能基中求解一个简单的多线性模型,得到了新技能涌现的解析表达式,以及损失随训练时间、数据规模、模型规模和最优计算量($C$)变化的标度律。我们将详细计算结果与在多重任务稀疏奇偶性问题上训练的两层神经网络的直接模拟进行对比,其中数据集中任务的分布遵循幂律。我们的简单模型仅使用单个拟合参数,即可刻画神经网络中随训练时间、数据规模或模型规模增加时多种新技能的S型涌现现象。