Transformers with linear attention (i.e., linear transformers) and state-space models have recently been suggested as a viable linear-time alternative to transformers with softmax attention. However, these models still underperform transformers especially on tasks that require in-context retrieval. While more expressive variants of linear transformers which replace the additive update in linear transformers with the delta rule (DeltaNet) have been found to be more effective at associative recall, existing algorithms for training such models do not parallelize over sequence length and are thus inefficient to train on modern hardware. This work describes a hardware-efficient algorithm for training linear transformers with the delta rule, which exploits a memory-efficient representation for computing products of Householder matrices. This algorithm allows us to scale up DeltaNet to standard language modeling settings. We train a 1.3B model for 100B tokens and find that it outperforms recent linear-time baselines such as Mamba and GLA in terms of perplexity and zero-shot performance on downstream tasks. We also experiment with two hybrid models which combine DeltaNet layers with (1) sliding-window attention layers every other layer or (2) two global attention layers, and find that these hybrids outperform strong transformer baselines.
翻译:具有线性注意力(即线性Transformer)和状态空间模型的Transformer最近被提出作为具有softmax注意力的Transformer的一种可行的线性时间替代方案。然而,这些模型,尤其是在需要上下文检索的任务上,其性能仍然不及标准Transformer。虽然通过用Delta规则(DeltaNet)替换线性Transformer中的加法更新而得到的、更具表达力的线性Transformer变体已被发现在关联回忆任务上更为有效,但现有训练此类模型的算法无法在序列长度上并行化,因此在现代硬件上训练效率低下。本研究描述了一种用于训练基于Delta规则的线性Transformer的硬件高效算法,该算法利用了一种内存高效表示来计算Householder矩阵的乘积。该算法使我们能够将DeltaNet扩展到标准的语言建模场景。我们训练了一个13亿参数的模型,使用了1000亿个token,发现其在困惑度以及下游任务的零样本性能方面优于最近的线性时间基线模型,如Mamba和GLA。我们还实验了两种混合模型,它们将DeltaNet层与(1)每隔一层加入的滑动窗口注意力层,或(2)两个全局注意力层相结合,发现这些混合模型优于强大的Transformer基线模型。