Sparse Modular Activation for Efficient Sequence Modeling

Linear State Space Models (SSMs) have demonstrated strong performance in a variety of sequence modeling tasks due to their efficient encoding of the recurrent structure. However, in more comprehensive tasks like language modeling and machine translation, self-attention-based models still outperform SSMs. Hybrid models employing both SSM and self-attention generally show promising performance, but current approaches apply attention modules statically and uniformly to all elements in the input sequences, leading to sub-optimal quality-efficiency trade-offs. In this work, we introduce Sparse Modular Activation (SMA), a general mechanism enabling neural networks to sparsely and dynamically activate sub-modules for sequence elements in a differentiable manner. Through allowing each element to skip non-activated sub-modules, SMA reduces computation and memory consumption at both training and inference stages of sequence modeling. As a specific instantiation of SMA, we design a novel neural architecture, SeqBoat, which employs SMA to sparsely activate a Gated Attention Unit (GAU) based on the state representations learned from an SSM. By constraining the GAU to only conduct local attention on the activated inputs, SeqBoat can achieve linear inference complexity with theoretically infinite attention span, and provide substantially better quality-efficiency trade-off than the chunking-based models. With experiments on a wide range of tasks, including language modeling, speech classification and long-range arena, SeqBoat brings new state-of-the-art results among hybrid models with linear complexity and reveals the amount of attention needed for each task through the learned sparse activation patterns.

翻译：线性状态空间模型（Linear State Space Models, SSMs）因其对循环结构的高效编码，在多种序列建模任务中表现出强劲性能。然而，在语言建模和机器翻译等更复杂的任务中，基于自注意力的模型仍优于SSMs。同时采用SSM和自注意力的混合模型通常展现出有前景的性能，但当前方法将注意力模块静态且均匀地应用于输入序列中的所有元素，导致质量-效率权衡次优。在这项工作中，我们提出了稀疏模块化激活（Sparse Modular Activation, SMA），这是一种通用机制，使神经网络能够以可微的方式对序列元素进行稀疏且动态的子模块激活。通过允许每个元素跳过未激活的子模块，SMA在序列建模的训练和推理阶段均能减少计算量和内存消耗。作为SMA的具体实例化，我们设计了一种新颖的神经架构SeqBoat，它利用SMA基于从SSM学习到的状态表示，稀疏地激活门控注意力单元（Gated Attention Unit, GAU）。通过将GAU限制为仅对激活的输入执行局部注意力，SeqBoat能够实现理论上无限注意力跨度的线性推理复杂度，并在质量-效率权衡上显著优于基于分块的模型。通过在包括语言建模、语音分类和长程竞技场（Long-Range Arena）在内的广泛任务上进行实验，SeqBoat在线性复杂度混合模型中取得了新的最先进结果，并通过学习到的稀疏激活模式揭示了每个任务所需的注意力量。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日