Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. We demonstrate that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic-$k$ expert selection rule that adjusts the number of executed experts on a per-token basis. To achieve further savings, we extend this approach to multi-head attention projections. Finally, we develop an efficient implementation that translates these computational savings into actual wall-clock speedup. The proposed method, Dense to Dynamic-$k$ Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, reducing inference cost by up to 60% without significantly impacting performance.
翻译:Transformer模型因其高计算需求而面临实际应用限制。同时,此类模型表现出显著的激活稀疏性,可通过将网络部分转换为等效的混合专家(MoE)层来降低推理成本。尽管激活稀疏性在此过程中起着关键作用,但其对该过程的影响尚未得到充分探索。我们证明,通过对基础模型的激活稀疏性进行适当正则化,可以显著提升转换效率。此外,受不同输入激活神经元数量高方差的启发,我们引入了一种更有效的动态-$k$专家选择规则,该规则基于每个令牌动态调整执行的专家数量。为进一步节省计算资源,我们将此方法扩展至多头注意力投影层。最后,我们开发了一种高效实现方案,将这些计算节省转化为实际运行时间的加速。所提出的方法——稠密到动态-$k$混合专家模型(D2DMoE)——在常见NLP和视觉任务上优于现有方法,在未显著影响性能的前提下将推理成本降低达60%。