Hybrid linear attention models offer an appealing path to faster long-context inference: they reduce the quadratic cost and KV-cache burden of full softmax attention while retaining much of the quality of Transformer models. A practical way to obtain such models is to convert a pretrained Transformer instead of pretraining a new architecture from scratch, but this conversion is still brittle. Simply copying the teacher attention projections into a Gated DeltaNet (GDN) student does not specify the new recurrent decay, write, and output-gating dynamics. As a result, the converted model often starts in a poor dynamical regime and must spend many distillation tokens repairing initialization rather than learning the remaining teacher behavior. We propose Taylor-Calibrate, a lightweight initialization method for hybrid GDN students. The method uses Taylor-guided teacher attention statistics to set the value projection, memory timescale, write gates, and output gate, then applies a short per-layer alignment step to match each converted layer to the teacher output. Across four teacher settings and three retained-layer policies, Taylor-Calibrate gives substantially stronger zero-shot students, with up to an 88x improvement in a representative ablation, and reaches matched recovery targets with 4.9x--9.2x fewer training tokens than naive conversion.
翻译:混合线性注意力模型为更快的长上下文推理提供了一条有吸引力的路径:它们在降低全softmax注意力的二次计算成本和KV缓存负担的同时,保留了Transformer模型的相当部分质量。获得此类模型的一种实用方法是对预训练的Transformer进行转换,而非从头开始预训练新架构,但这种转换仍存在脆弱性。简单地将教师模型的注意力投影复制到门控DeltaNet (GDN) 学生模型中,无法指定新的循环衰减、写入和输出门控动态。因此,转换后的模型往往起始于较差的动态区间,并需要花费大量蒸馏代币来修复初始化,而非学习教师模型的其余行为。我们提出泰勒校准(Taylor-Calibrate),一种针对混合GDN学生模型的轻量级初始化方法。该方法利用泰勒引导的教师注意力统计量来设定值投影、记忆时间尺度、写入门和输出门,随后应用简短的逐层对齐步骤,使每个转换后的层与教师输出相匹配。在四种教师设置和三种保留层策略下,泰勒校准生成了显著更强的零样本学生模型,在代表性消融实验中提升幅度高达88倍,并且与朴素转换相比,达到匹配恢复目标所需的训练代币减少了4.9至9.2倍。