Linear Transformers and State Space Models have emerged as efficient alternatives to softmax Transformers for causal sequence modeling, enabling parallel training via matrix multiplication and efficient RNN-style inference. However, despite their success in causal tasks, no unified framework exists for applying Linear Transformers to bidirectional sequence modeling. We introduce LION, the first framework to systematically extend Linear Transformers to the bidirectional setting. LION generalizes three core representations commonly used in the causal case - full Linear Attention , bidirectional RNN, and chunkwise parallel form - to the bidirectional setting. These forms are theoretically equivalent and enable models to exploit the strengths of each during training and inference. We prove that a broad class of Linear Transformers can be extended using LION and validate our framework via three core examples based on the choice of decay type: LION-LIT, the bidirectional extension of arXiv:2006.16236; LION-D, based on arXiv:2307.08621; and LION-S, a variant using selective decay arXiv:2103.02143, arXiv:2312.0075. Across standard bidirectional tasks, LION enables models to match or exceed the performance of softmax Transformers, while offering significantly faster training and more efficient inference than existing State Space Models.
翻译:线性Transformer与状态空间模型已成为softmax Transformer在因果序列建模任务中的高效替代方案,其通过矩阵乘法实现并行训练,并支持高效的RNN风格推理。然而,尽管在因果任务中取得了成功,目前尚缺乏将线性Transformer应用于双向序列建模的统一框架。本文提出LION框架,首次系统性地将线性Transformer扩展至双向建模场景。LION将因果建模中常用的三种核心表示形式——完整线性注意力、双向RNN以及分块并行形式——统一推广至双向设置。这些形式在理论上是等价的,使得模型能够在训练与推理阶段灵活利用各自的优势。我们证明了一类广泛的线性Transformer均可通过LION进行扩展,并基于衰减类型的选择通过三个核心实例验证了框架的有效性:基于arXiv:2006.16236的双向扩展LION-LIT;基于arXiv:2307.08621的LION-D;以及采用选择性衰减变体(arXiv:2103.02143, arXiv:2312.0075)的LION-S。在标准双向任务中,LION使模型性能达到或超越softmax Transformer,同时在训练速度上显著优于现有状态空间模型,并具备更高效的推理能力。