Is it always necessary to compute tokens from shallow to deep layers in Transformers? The continued success of vanilla Transformers and their variants suggests an undoubted "yes". In this work, however, we attempt to break the depth-ordered convention by proposing a novel architecture dubbed mixture-of-modules (MoM), which is motivated by an intuition that any layer, regardless of its position, can be used to compute a token as long as it possesses the needed processing capabilities. The construction of MoM starts from a finite set of modules defined by multi-head attention and feed-forward networks, each distinguished by its unique parameterization. Two routers then iteratively select attention modules and feed-forward modules from the set to process a token. The selection dynamically expands the computation graph in the forward pass of the token, culminating in an assembly of modules. We show that MoM provides not only a unified framework for Transformers and their numerous variants but also a flexible and learnable approach for reducing redundancy in Transformer parameterization. We pre-train various MoMs using OpenWebText. Empirical results demonstrate that MoMs, of different parameter counts, consistently outperform vanilla transformers on both GLUE and XSUM benchmarks. More interestingly, with a fixed parameter budget, MoM-large enables an over 38% increase in depth for computation graphs compared to GPT-2-large, resulting in absolute gains of 1.4 on GLUE and 1 on XSUM. On the other hand, MoM-large also enables an over 60% reduction in depth while involving more modules per layer, yielding a 16% reduction in TFLOPs and a 43% decrease in memory usage compared to GPT-2-large, while maintaining comparable performance.
翻译:在Transformer中,是否总是需要从浅层到深层逐层计算词元?经典Transformer及其变体的持续成功似乎给出了毋庸置疑的肯定答案。然而,本研究试图打破这种深度有序的计算范式,提出了一种称为模块混合(MoM)的新型架构。其设计动机基于一个直观认知:只要某层具备所需的处理能力,无论其位置如何,皆可用于计算词元。MoM的构建始于由多头注意力和前馈网络定义的有限模块集合,每个模块通过独特的参数化进行区分。两个路由器随后从集合中迭代选择注意力模块和前馈模块来处理词元。这种选择机制在前向传播过程中动态扩展词元的计算图,最终形成模块的组合装配。我们证明MoM不仅为Transformer及其众多变体提供了统一框架,还为减少Transformer参数化冗余提供了灵活可学习的方法。我们使用OpenWebText数据集对多种MoM模型进行预训练。实证结果表明,在不同参数量级下,MoM模型在GLUE和XSUM基准测试中均持续优于经典Transformer。更有趣的是,在固定参数预算下,MoM-large相比GPT-2-large能使计算图深度增加超过38%,从而在GLUE和XSUM上分别获得1.4和1的绝对性能提升。另一方面,MoM-large也可在每层引入更多模块的同时实现超过60%的深度缩减,与GPT-2-large相比,在保持相当性能的前提下,计算量(TFLOPs)降低16%,内存使用减少43%。