Recent work has demonstrated the potential of non-transformer language models, especially linear recurrent neural networks (RNNs) and hybrid models that mix recurrence and attention. Yet there is no consensus on whether the potential benefits of these new architectures justify the risk and effort of scaling them up. To address this, we provide evidence for the advantages of hybrid models over pure transformers on several fronts. First, theoretically, we show that hybrid models do not merely inherit the expressivity of transformers and linear RNNs, but can express tasks beyond both, such as code execution. Putting this theory to practice, we train Olmo Hybrid, a 7B-parameter model largely comparable to Olmo 3 7B but with the sliding window layers replaced by Gated DeltaNet layers. We show that Olmo Hybrid outperforms Olmo 3 across standard pretraining and mid-training evaluations, demonstrating the benefit of hybrid models in a controlled, large-scale setting. We find that the hybrid model scales significantly more efficiently than the transformer, explaining its higher performance. However, its unclear why greater expressivity on specific formal problems should result in better scaling or superior performance on downstream tasks unrelated to those problems. To explain this apparent gap, we return to theory and argue why increased expressivity should translate to better scaling efficiency, completing the loop. Overall, our results suggest that hybrid models mixing attention and recurrent layers are a powerful extension to the language modeling paradigm: not merely to reduce memory during inference, but as a fundamental way to obtain more expressive models that scale better during pretraining.
翻译:近期研究已展示了非Transformer语言模型(尤其是线性递归神经网络及其与注意力机制混合的模型)的潜力。然而,关于这些新型架构的潜在优势是否足以承担其规模化扩展的风险与投入,学界尚未达成共识。针对这一问题,我们从多个方面提供了混合模型优于纯Transformer的证据。首先在理论层面,我们证明混合模型不仅继承了Transformer和线性RNN的表达能力,还能实现超越二者的任务(如代码执行)。为将理论付诸实践,我们训练了Olmo Hybrid——一个7B参数模型,其架构与Olmo 3 7B高度相似,但将滑动窗口层替换为门控DeltaNet层。实验表明,在标准预训练与中训练评估中,Olmo Hybrid全面优于Olmo 3,验证了混合模型在受控大规模场景中的优势。我们发现混合模型具有显著高于Transformer的扩展效率,这解释了其更优性能的原因。然而,令人困惑的是:针对特定形式化问题更强的表达能力,为何能带来更好的扩展效率或在非相关下游任务上更优的表现?为解释这一明显矛盾,我们回归理论分析,论证为何更强的表达能力应转化为更优的扩展效率,从而形成理论闭环。总体而言,我们的研究表明,融合注意力与递归层的混合模型是语言建模范式的有力扩展——其意义不仅在于降低推理阶段的内存消耗,更在于从根本上构建更具表达能力且在预训练阶段扩展性更优的模型。