Recent LLMs have hundreds of billions of parameters consuming vast resources. Furthermore, the so called "AI scaling law" for transformers suggests that the number of parameters must scale linearly with the size of the data. In response, we inquire into efficient LLMs, i.e. those with the fewest parameters that achieve the desired accuracy on a training corpus. Specifically, by comparing theoretical and empirical estimates of the Kullback-Leibler divergence, we derive a natural AI scaling law that the number of parameters in an efficient LLM scales as $D^γ$ where $D$ is the size of the training data and $ γ\in [0.44, 0.72]$, suggesting the existence of more efficient architectures. Against this backdrop, we propose recurrent transformers, combining the efficacy of transformers with the efficiency of recurrent networks, progressively applying a single transformer layer to a fixed-width sliding window across the input sequence. Recurrent transformers (a) run in linear time in the sequence length, (b) are memory-efficient and amenable to parallel processing in large batches, (c) learn to forget history for language tasks, or accumulate history for long range tasks like copy and selective copy, and (d) are amenable to curriculum training to overcome vanishing gradients. In our experiments, we find that recurrent transformers perform favorably on benchmark tests.
翻译:近期的大语言模型拥有数千亿参数,消耗大量资源。此外,针对Transformer的所谓"AI缩放定律"表明,参数数量必须与数据规模呈线性增长。为此,我们探究高效的大语言模型,即在训练语料上以最少参数达到预期精度的模型。具体而言,通过比较理论估计与经验估计的Kullback-Leibler散度,我们推导出一个自然的AI缩放定律:高效大语言模型的参数数量按$D^γ$缩放,其中$D$为训练数据规模,$γ\in [0.44, 0.72]$,这表明存在更高效的架构。在此背景下,我们提出循环Transformer,将Transformer的有效性与循环网络的高效性相结合,通过固定宽度的滑动窗口对输入序列逐步应用单个Transformer层。循环Transformer具有以下特性:(a) 在序列长度上实现线性时间复杂度;(b) 内存效率高,适合大批量并行处理;(c) 在语言任务中学会遗忘历史,或在复制及选择性复制等长程任务中积累历史;(d) 可通过课程训练克服梯度消失问题。实验结果表明,循环Transformer在基准测试中表现优异。