Transformer models, despite their impressive performance, often face practical limitations due to their high computational requirements. At the same time, such models exhibit significant activation sparsity, which can be leveraged to reduce the inference cost by transforming parts of the network into Mixture-of-Experts (MoE) layers. However, despite the crucial role of activation sparsity, its impact on this process remains unexplored. In this paper, we enhance the efficiency of MoE conversion through activation sparsity enforcement. Moreover, motivated by the high variance in the number of activated neurons, we propose a more effective dynamic-k expert selection rule that adjusts the number of executed experts on a per-token basis. Finally, we extend this approach to multi-head attention projections, which results in even further savings. The proposed method, Sparsified Activation Dynamic-k Mixture-of-Experts (SADMoE), outperforms existing approaches on common NLP and vision tasks, allowing us to save up to 60% of inference cost without significantly affecting model performance.
翻译:尽管Transformer模型性能卓越,但其高计算需求常导致实际应用受限。与此同时,这类模型展现出显著的激活稀疏性——通过将网络部分转换为混合专家(MoE)层可有效降低推理成本。然而,尽管激活稀疏性起着关键作用,其对转换过程的影响机制尚未得到充分探索。本文通过激活稀疏性增强策略提升了MoE转换效率。此外,针对激活神经元数量高方差特性,我们提出更高效的动态k专家选择机制,该机制可基于单个令牌调整执行专家数量。最后,我们将该方法扩展至多头注意力投影,实现了更显著的计算节约。所提出的稀疏激活动态k混合专家模型(SADMoE)在常规自然语言处理和视觉任务中均优于现有方法,可在保证模型性能基本不变的前提下节省高达60%的推理成本。