Transformer models can face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by converting parts of the network into equivalent Mixture-of-Experts (MoE) layers. Despite the crucial role played by activation sparsity, its impact on this process remains unexplored. In particular, we show that the efficiency of the conversion can be significantly enhanced by a proper regularization of the activation sparsity of the base model. Moreover, motivated by the high variance of the number of activated neurons for different inputs, we introduce a more effective dynamic-k expert selection rule that adjusts the number of executed experts on a per-token basis. Finally, we extend this approach to multi-head attention projections, which results in additional savings compared to only converting the FFN blocks. The proposed method, Dense to Dynamic-$k$ Mixture-of-Experts (D2DMoE), outperforms existing approaches on common NLP and vision tasks, allowing us to save up to 60% of inference cost without significantly affecting model performance.
翻译:Transformer模型因其高计算需求而面临实际应用限制。与此同时,此类模型表现出显著的激活稀疏性,可通过将网络部分转换为等效的专家混合(MoE)层来降低推理成本。尽管激活稀疏性在此过程中起着关键作用,但其对该过程的影响尚未得到充分探索。具体而言,我们证明通过对基础模型的激活稀疏性进行适当正则化,可以显著提升转换效率。此外,受不同输入激活神经元数量高度方差的启发,我们引入了一种更有效的动态k值专家选择规则,该规则可根据每个令牌动态调整执行的专家数量。最后,我们将此方法扩展到多头注意力投影层,与仅转换前馈网络模块相比,实现了额外的计算节省。所提出的方法——稠密到动态k值专家混合模型(D2DMoE)——在常见NLP和视觉任务上优于现有方法,可在不明显影响模型性能的情况下节省高达60%的推理成本。