The interest in linear complexity models for large language models is on the rise, although their scaling capacity remains uncertain. In this study, we present the scaling laws for linear complexity language models to establish a foundation for their scalability. Specifically, we examine the scaling behaviors of three efficient linear architectures. These include TNL, a linear attention model with data-independent decay; HGRN2, a linear RNN with data-dependent decay; and cosFormer2, a linear attention model without decay. We also include LLaMA as a baseline architecture for softmax attention for comparison. These models were trained with six variants, ranging from 70M to 7B parameters on a 300B-token corpus, and evaluated with a total of 1,376 intermediate checkpoints on various downstream tasks. These tasks include validation loss, commonsense reasoning, and information retrieval and generation. The study reveals that existing linear complexity language models exhibit similar scaling capabilities as conventional transformer-based models while also demonstrating superior linguistic proficiency and knowledge retention.
翻译:尽管线性复杂度模型在大语言模型中的应用日益受到关注,但其缩放能力仍不明确。本研究提出了线性复杂度语言模型的缩放定律,为其可扩展性奠定基础。具体而言,我们考察了三种高效线性架构的缩放行为。这些架构包括:TNL,一种具有数据无关衰减的线性注意力模型;HGRN2,一种具有数据相关衰减的线性循环神经网络;以及cosFormer2,一种无衰减的线性注意力模型。我们还纳入了LLaMA作为softmax注意力的基线架构以进行比较。这些模型在包含3000亿个词元的语料库上,以从7000万到70亿参数的六种变体进行训练,并在各种下游任务上使用总计1,376个中间检查点进行评估。这些任务包括验证损失、常识推理以及信息检索与生成。研究表明,现有的线性复杂度语言模型展现出与传统基于Transformer的模型相似的缩放能力,同时表现出更优越的语言熟练度和知识保留能力。