Transformers' quadratic computational complexity limits their scalability despite remarkable performance. While linear attention reduces this to linear complexity, pre-training such models from scratch remains, in most cases, prohibitively expensive. Recent post-training linearisation methods convert pre-trained Transformers to linear models efficiently, often using hybrid approaches that combine linear attention with sliding-window softmax. We identify a critical flaw: existing hybrid methods inadvertently bypass the linear component, relying almost entirely on SWA. Component-level diagnostics reveal this previously undetected behaviour stems from overlooked evaluation practices on common-sense benchmarks. We propose three solutions to ensure balanced component usage: (i) inference-time hybridisation of linear-only conversions with sliding-window softmax; (ii) HedgeCATs, combining attention-weight transfer with targeted LoRA fine-tuning; and (iii) Scheduled Sliding-window Dropout (SSD), which stochastically suppresses the softmax branch during training to prevent component collapse. Our methods maintain computational efficiency while recovering most base model performance and ensuring genuine linear attention adoption, restoring the validity of performance attributions in hybrid conversions.
翻译:Transformer 的二次计算复杂度限制了其可扩展性,尽管其性能卓越。虽然线性注意力将复杂度降至线性,但在大多数情况下,从头开始预训练此类模型仍然成本高昂。最近的训练后线性化方法能高效地将预训练的 Transformer 转换为线性模型,通常采用混合方法,将线性注意力与滑动窗口 softmax 相结合。我们发现了一个关键缺陷:现有的混合方法无意中绕过了线性组件,几乎完全依赖于滑动窗口 softmax。组件级诊断表明,这种先前未被检测到的行为源于在常识基准测试中被忽视的评估实践。我们提出了三种解决方案来确保组件的平衡使用:(i) 在推理时对纯线性转换与滑动窗口 softmax 进行混合;(ii) HedgeCATs,将注意力权重迁移与有针对性的 LoRA 微调相结合;(iii) 计划性滑动窗口丢弃,在训练期间随机抑制 softmax 分支以防止组件崩溃。我们的方法在保持计算效率的同时,恢复了大部分基础模型的性能,并确保了真正的线性注意力采用,从而恢复了混合转换中性能归因的有效性。