Trained LLMs are typically sparse in that most of the parameters are zero, raising questions on efficiency. In response, we inquire into efficient LLMs, i.e. those with the fewest parameters that achieve the desired accuracy on a training corpus. Specifically, we compare theoretical and empirical estimates for training loss at current scale to obtain upper and lower bounds on the number of unique sequences in a natural training corpus as a function of its size. Our result implies (1) to double the number of skills represented in a training corpus, the corpus must scale roughly between three and five fold (2) for efficient LLMs, the number of parameters $N$ and the size $D$ of a natural training corpus scale as $N \sim D^{0.58}$ (3) if the number of parameters of an LLM is smaller than the number of unique sequences in the training corpus, scaling up can uncover emergent skills.
翻译:训练完成的大型语言模型(LLMs)通常具有稀疏性,即大部分参数为零,这引发了关于其效率的质疑。为此,我们探究了高效能LLMs——即能以最少参数在训练语料上达到所需精度的模型。具体而言,我们比较了当前规模下训练损失的理论与经验估计,从而将自然训练语料中独特序列数量作为其规模的函数,得到了上下界。研究结果表明:(1)若要使训练语料中涵盖的技能数量翻倍,语料规模需大致扩大三到五倍;(2)对于高效能LLMs,参数数量$N$与自然训练语料规模$D$的标度关系为$N \sim D^{0.58}$;(3)若LLM的参数数量小于训练语料中的独特序列数量,则通过扩大规模可发掘涌现型技能。