Linear State Space Models (SSMs) have demonstrated strong performance in a variety of sequence modeling tasks due to their efficient encoding of the recurrent structure. However, in more comprehensive tasks like language modeling and machine translation, self-attention-based models still outperform SSMs. Hybrid models employing both SSM and self-attention generally show promising performance, but current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. In this work, we introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely and dynamically activate sub-modules for sequence elements in a differentiable manner. Through allowing each element to skip non-activated sub-modules, SMA reduces computation and memory consumption at both training and inference stages of sequence modeling. As a specific instantiation of SMA, we design a novel neural architecture, SeqBoat, which employs SMA to sparsely activate a Gated Attention Unit (GAU) based on the state representations learned from an SSM. By constraining the GAU to only conduct local attention on the activated inputs, SeqBoat can achieve linear inference complexity with theoretically infinite attention span, and provide substantially better quality-efficiency trade-off than the chunking-based models. With experiments on a wide range of tasks, including language modeling, speech classification and long-range arena, SeqBoat brings new state-of-the-art results among hybrid models with linear complexity and reveals the amount of attention needed for each task through the learned sparse activation patterns.
翻译:线性状态空间模型(SSMs)通过高效编码循环结构,在多种序列建模任务中展现了强劲性能。然而,在语言建模和机器翻译等复杂任务中,基于自注意力的模型仍优于SSMs。采用SSM与自注意力结合的混合模型通常表现出有前景的性能,但当前方法将注意力模块静态且均匀地应用于输入序列中的所有元素,导致质量-效率权衡次优。本文提出稀疏模块化激活(SMA),这是一种通用机制,能够以可微分方式让神经网络稀疏且动态地激活序列元素的子模块。通过允许每个元素跳过未激活子模块,SMA在序列建模的训练和推理阶段均减少了计算和内存消耗。作为SMA的具体实例化,我们设计了一种新型神经架构SeqBoat,它利用SMA基于从SSM学习到的状态表示稀疏激活门控注意力单元(GAU)。通过约束GAU仅对激活输入进行局部注意力处理,SeqBoat能以理论上无限的注意力跨度实现线性推理复杂度,并提供比基于分块模型显著更优的质量-效率权衡。在语言建模、语音分类和长程竞技场等广泛任务上的实验表明,SeqBoat在线性复杂度混合模型中达到了新的最优结果,并通过学习到的稀疏激活模式揭示了每项任务所需的注意力量。