Hybrid Transformer architectures, which combine softmax attention blocks and recurrent neural networks (RNNs), have shown a desirable performance-throughput tradeoff for long-context modeling, but their adoption and studies are hindered by the prohibitive cost of large-scale pre-training from scratch. Some recent studies have shown that pre-trained softmax attention blocks can be converted into RNN blocks through parameter transfer and knowledge distillation. However, these transfer methods require substantial amounts of training data (more than 10B tokens), and the resulting hybrid models also exhibit poor long-context performance, which is the scenario where hybrid models enjoy significant inference speedups over Transformer-based models. In this paper, we present HALO (Hybrid Attention via Layer Optimization), a pipeline for distilling Transformer models into RNN-attention hybrid models. We then present HypeNet, a hybrid architecture with superior length generalization enabled by a novel position encoding scheme (named HyPE) and various architectural modifications. We convert the Qwen3 series into HypeNet using HALO, achieving performance comparable to the original Transformer models while enjoying superior long-context performance and efficiency. The conversion requires just 2.3B tokens, less than 0.01% of their pre-training data
翻译:混合Transformer架构通过结合softmax注意力模块与循环神经网络(RNN),已在长上下文建模中展现出理想的性能与吞吐量权衡,但其大规模从头预训练所需的巨大成本阻碍了该架构的推广应用及相关研究。近期研究表明,通过参数迁移与知识蒸馏可将预训练的softmax注意力模块转化为RNN模块。然而,现有迁移方法需要海量训练数据(超过100亿词元),且所得混合模型在长上下文场景中表现欠佳——而该场景正是混合模型相比基于Transformer的模型能获得显著推理加速的优势所在。本文提出HALO(基于层级优化的混合注意力)——一种将Transformer模型蒸馏为RNN-注意力混合模型的流程,并进一步提出HypeNet混合架构。该架构通过新颖的位置编码方案(命名为HyPE)及多项结构改进,实现了卓越的长度泛化能力。我们使用HALO将Qwen3系列模型转换为HypeNet,在保持与原始Transformer模型相当性能的同时,获得了更优异的长上下文处理性能与效率。该转换过程仅需23亿词元,不足其预训练数据量的0.01%。