Transformers have proved effective in many NLP tasks. However, their training requires non-trivial efforts regarding designing cutting-edge optimizers and learning rate schedulers carefully (e.g., conventional SGD fails to train Transformers effectively). Our objective here is to understand $\textit{what complicates Transformer training}$ from both empirical and theoretical perspectives. Our analysis reveals that unbalanced gradients are not the root cause of the instability of training. Instead, we identify an amplification effect that influences training substantially -- for each layer in a multi-layer Transformer model, heavy dependency on its residual branch makes training unstable, since it amplifies small parameter perturbations (e.g., parameter updates) and results in significant disturbances in the model output. Yet we observe that a light dependency limits the model potential and leads to inferior trained models. Inspired by our analysis, we propose Admin ($\textbf{Ad}$aptive $\textbf{m}$odel $\textbf{in}$itialization) to stabilize stabilize the early stage's training and unleash its full potential in the late stage. Extensive experiments show that Admin is more stable, converges faster, and leads to better performance. Implementations are released at: https://github.com/LiyuanLucasLiu/Transforemr-Clinic.
翻译:Transformer已在多项自然语言处理任务中展现出卓越效果,然而其训练过程需要精心设计前沿优化器和学习率调度策略(例如传统随机梯度下降方法难以有效训练Transformer)。本文旨在从经验与理论双重角度理解$\textit{是什么让Transformer训练变得复杂}$。我们的分析表明,梯度不平衡并非训练不稳定的根本原因。相反,我们发现了显著影响训练的放大效应——在多层级Transformer模型的每一层中,对其残差分支的强依赖会导致训练不稳定,因为这会放大微小的参数扰动(如参数更新),进而在模型输出中产生显著波动。但我们也观察到,过弱的依赖会限制模型潜力,导致训练出的模型性能欠佳。受此分析启发,我们提出了Admin($\textbf{Ad}$aptive $\textbf{m}$odel $\textbf{in}$itialization,自适应模型初始化方法),以稳定训练早期阶段,并在后期充分发挥模型潜力。大量实验表明,Admin具有更好的稳定性、更快的收敛速度,并能带来更优性能。实现代码已开源至:https://github.com/LiyuanLucasLiu/Transforemr-Clinic。