Previous work on Universal Transformers (UTs) has demonstrated the importance of parameter sharing across layers. By allowing recurrence in depth, UTs have advantages over standard Transformers in learning compositional generalizations, but layer-sharing comes with a practical limitation of parameter-compute ratio: it drastically reduces the parameter count compared to the non-shared model with the same dimensionality. Naively scaling up the layer size to compensate for the loss of parameters makes its computational resource requirements prohibitive. In practice, no previous work has succeeded in proposing a shared-layer Transformer design that is competitive in parameter count-dominated tasks such as language modeling. Here we propose MoEUT (pronounced "moot"), an effective mixture-of-experts (MoE)-based shared-layer Transformer architecture, which combines several recent advances in MoEs for both feedforward and attention layers of standard Transformers together with novel layer-normalization and grouping schemes that are specific and crucial to UTs. The resulting UT model, for the first time, slightly outperforms standard Transformers on language modeling tasks such as BLiMP and PIQA, while using significantly less compute and memory.
翻译:先前关于通用Transformer(UT)的研究已证明跨层参数共享的重要性。通过允许深度递归,UT在学习组合泛化方面优于标准Transformer,但层共享带来了参数计算比的实践限制:与相同维度的非共享模型相比,它显著减少了参数量。单纯扩大层规模以弥补参数损失会使其计算资源需求变得难以承受。实际上,此前研究尚未成功提出一种在语言建模等参数量主导任务中具有竞争力的共享层Transformer设计。本文提出MoEUT(发音同"moot"),一种基于专家混合(MoE)的高效共享层Transformer架构。该架构结合了近期MoE在标准Transformer前馈层与注意力层的多项进展,并针对UT设计了新颖的层归一化与分组方案,这些方案对UT至关重要。最终得到的UT模型首次在BLiMP和PIQA等语言建模任务上略微超越标准Transformer,同时显著减少了计算与内存消耗。