Pre-trained Transformers inherently possess the characteristic of sparse activation, where only a small fraction of the neurons are activated for each token. While sparse activation has been explored through post-training methods, its potential in pre-training remains untapped. In this work, we first study how activation properties change during pre-training. Our examination reveals that Transformers exhibit sparse activation throughout the majority of the pre-training process while the activation correlation keeps evolving as training progresses. Leveraging this observation, we propose Switchable Sparse-Dense Learning (SSD). SSD adaptively switches between the Mixtures-of-Experts (MoE) based sparse training and the conventional dense training during the pre-training process, leveraging the efficiency of sparse training and avoiding the static activation correlation of sparse training. Compared to dense training, SSD achieves comparable performance with identical model size and reduces pre-training costs. Moreover, the models trained with SSD can be directly used as MoE models for sparse inference and achieve the same performance as dense models with up to $2\times$ faster inference speed. Codes are available at https://github.com/thunlp/moefication.
翻译:预训练的Transformer模型天然具有激活稀疏的特性,即每个token仅激活少量神经元。尽管已有研究通过后训练方法探索激活稀疏性,但其在预训练阶段的潜力尚未被充分挖掘。本文首先研究了激活特性在预训练过程中的变化规律。实验分析表明,Transformer在预训练的大部分阶段均呈现激活稀疏性,同时激活相关性随训练进程持续演化。基于这一发现,我们提出可切换稀疏-稠密学习(SSD)方法。SSD在预训练过程中自适应地在基于混合专家(MoE)的稀疏训练与常规稠密训练之间切换,既能利用稀疏训练的效率优势,又可避免稀疏训练中静态激活相关性的问题。与稠密训练相比,SSD在保持相同模型规模的情况下达到相当的性能,同时显著降低预训练成本。此外,通过SSD训练的模型可直接作为MoE模型进行稀疏推理,在达到与稠密模型相同性能的前提下,推理速度最高可提升$2\times$。代码已开源:https://github.com/thunlp/moefication。