Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive update in linear transformers with the delta rule (DeltaNet) have been found to be more effective at associative recall, existing algorithms for training such models do not parallelize over sequence length and are thus inefficient to train on modern hardware. This work describes a hardware-efficient algorithm for training linear transformers with the delta rule, which exploits a memory-efficient representation for computing products of Householder matrices. This algorithm allows us to scale up DeltaNet to standard language modeling settings. We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines such as Mamba and GLA in terms of perplexity and zero-shot performance on downstream tasks. We also experiment with two hybrid models which combine DeltaNet layers with (1) sliding-window attention layers every other layer or (2) two global attention layers, and find that these hybrids outperform strong transformer baselines.
翻译:具有线性注意力(即线性Transformer)和状态空间模型的Transformer最近被提出作为具有softmax注意力的Transformer的一种可行的线性时间替代方案。然而,这些模型的表现仍然逊色于Transformer,尤其是在需要上下文检索的任务上。虽然用Delta规则(DeltaNet)替换线性Transformer中加法更新的、更具表达力的线性Transformer变体已被发现在关联回忆方面更有效,但现有的训练此类模型的算法无法在序列长度上并行化,因此在现代硬件上训练效率低下。这项工作描述了一种用于训练基于Delta规则的线性Transformer的硬件高效算法,该算法利用了一种内存高效表示来计算Householder矩阵的乘积。该算法使我们能够将DeltaNet扩展到标准的语言建模设置。我们训练了一个13亿参数的模型,处理了1000亿个词元,发现其在困惑度和下游任务的零样本性能方面优于最近的线性时间基线模型,如Mamba和GLA。我们还尝试了两种混合模型,它们将DeltaNet层与(1)每隔一层使用的滑动窗口注意力层或(2)两个全局注意力层相结合,并发现这些混合模型优于强大的Transformer基线模型。