ResiDual: Transformer with Dual Residual Connections

Transformer networks have become the preferred architecture for many tasks due to their state-of-the-art performance. However, the optimal way to implement residual connections in Transformer, which are essential for effective training, is still debated. Two widely used variants are the Post-Layer-Normalization (Post-LN) and Pre-Layer-Normalization (Pre-LN) Transformers, which apply layer normalization after each residual block's output or before each residual block's input, respectively. While both variants enjoy their advantages, they also suffer from severe limitations: Post-LN causes gradient vanishing issue that hinders training deep Transformers, and Pre-LN causes representation collapse issue that limits model capacity. In this paper, we propose ResiDual, a novel Transformer architecture with Pre-Post-LN (PPLN), which fuses the connections in Post-LN and Pre-LN together and inherits their advantages while avoids their limitations. We conduct both theoretical analyses and empirical experiments to verify the effectiveness of ResiDual. Theoretically, we prove that ResiDual has a lower bound on the gradient to avoid the vanishing issue due to the residual connection from Pre-LN. Moreover, ResiDual also has diverse model representations to avoid the collapse issue due to the residual connection from Post-LN. Empirically, ResiDual outperforms both Post-LN and Pre-LN on several machine translation benchmarks across different network depths and data sizes. Thanks to the good theoretical and empirical performance, ResiDual Transformer can serve as a foundation architecture for different AI models (e.g., large language models). Our code is available at https://github.com/microsoft/ResiDual.

翻译：Transformer网络因其领先的性能已成为众多任务的首选架构。然而，如何最优地在Transformer中实现残差连接——这一对有效训练至关重要的组件——仍存在争议。两种广泛使用的变体是后层归一化（Post-LN）和前层归一化（Pre-LN）Transformer，它们分别将层归一化应用于每个残差块的输出之后或每个残差块的输入之前。尽管这两种变体各有优势，但它们也面临严重限制：Post-LN会导致梯度消失问题，阻碍深层Transformer的训练；Pre-LN则会导致表示坍缩问题，限制模型容量。在本文中，我们提出ResiDual，一种具有前后层归一化（PPLN）的新型Transformer架构，该架构融合了Post-LN和Pre-LN中的连接，继承两者优势同时避免了其局限性。我们进行了理论分析和实验验证以证实ResiDual的有效性。理论上，我们证明ResiDual由于来自Pre-LN的残差连接，其梯度具有下界，从而避免了梯度消失问题。此外，ResiDual由于来自Post-LN的残差连接，拥有多样化的模型表示，从而避免了表示坍缩问题。实验上，ResiDual在不同网络深度和数据规模的多个机器翻译基准测试中均优于Post-LN和Pre-LN。凭借优异的理论与实验表现，ResiDual Transformer可作为不同AI模型（如大语言模型）的基础架构。我们的代码已开源在https://github.com/microsoft/ResiDual。