Linear Transformers have gained attention as efficient alternatives to standard Transformers, but their performance in retrieval and long-context tasks has been limited. To address these limitations, recent work has explored two distinct mechanisms: gating for adaptive memory control and the delta update rule for precise memory modifications. We observe that these mechanisms are complementary: gating enables rapid memory erasure while the delta rule facilitates targeted updates. Building on this insight, we introduce the gated delta rule and develop a parallel training algorithm optimized for modern hardware. Our proposed architecture, Gated DeltaNet, consistently surpasses existing models like Mamba2 and DeltaNet across multiple benchmarks, including language modeling, common-sense reasoning, in-context retrieval, length extrapolation, and long-context understanding. We further enhance performance by developing hybrid architectures that combine Gated DeltaNet layers with sliding window attention or Mamba2 layers, achieving both improved training efficiency and superior task performance.
翻译:线性Transformer作为标准Transformer的高效替代方案已受到关注,但其在检索和长上下文任务中的性能一直受限。为应对这些局限,近期研究探索了两种不同的机制:用于自适应记忆控制的门控机制,以及用于精确记忆修改的delta更新规则。我们发现这两种机制具有互补性:门控机制可实现快速记忆擦除,而delta规则则有利于定向更新。基于这一洞见,我们提出了门控delta规则,并开发了针对现代硬件优化的并行训练算法。我们提出的架构——门控DeltaNet——在语言建模、常识推理、上下文检索、长度外推和长上下文理解等多个基准测试中,持续超越Mamba2和DeltaNet等现有模型。通过开发将门控DeltaNet层与滑动窗口注意力或Mamba2层相结合的混合架构,我们进一步提升了性能,实现了训练效率与任务性能的双重优化。