A simple design recipe for deep Transformers is to compose identical building blocks. But standard transformer blocks are far from simple, interweaving attention and MLP sub-blocks with skip connections & normalisation layers in precise arrangements. This complexity leads to brittle architectures, where seemingly minor changes can significantly reduce training speed, or render models untrainable. In this work, we ask to what extent the standard transformer block can be simplified? Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update training speed and performance of standard transformers, while enjoying 15% faster training throughput, and using 15% fewer parameters.
翻译:一种简单的深度Transformer设计方法是采用相同的构建块。然而,标准Transformer块远非简单,它通过精确的安排交织注意力与MLP子块,并配合跳跃连接与归一化层。这种复杂性导致架构变得脆弱,即便是微小的改动也可能显著降低训练速度,甚至使模型无法训练。本文探讨了标准Transformer块可以简化到何种程度。结合信号传播理论与实验观察,我们提出了改进措施,使得在不损失训练速度的前提下移除多个块组件成为可能,包括跳跃连接、投影或值参数、序列子块及归一化层。在仅解码器的自回归模型与仅编码器的BERT模型实验中,我们的简化Transformer在每次更新的训练速度和性能上均与标准Transformer相当,同时享受15%更快的训练吞吐量,并减少15%的参数使用。