Transformer models, despite their impressive performance, often face practical limitations due to their high computational requirements. At the same time, previous studies have revealed significant activation sparsity in these models, indicating the presence of redundant computations. In this paper, we propose Dynamic Sparsified Transformer Inference (DSTI), a method that radically reduces the inference cost of Transformer models by enforcing activation sparsity and subsequently transforming a dense model into its sparse Mixture of Experts (MoE) version. We demonstrate that it is possible to train small gating networks that successfully predict the relative contribution of each expert during inference. Furthermore, we introduce a mechanism that dynamically determines the number of executed experts individually for each token. DSTI can be applied to any Transformer-based architecture and has negligible impact on the accuracy. For the BERT-base classification model, we reduce inference cost by almost 60%.
翻译:Transformer模型尽管性能卓越,但由于其高计算需求常常面临实际应用限制。同时,先前研究表明这些模型存在显著的激活稀疏性,表明存在冗余计算。本文提出动态稀疏化Transformer推理(DSTI)方法,该方法通过强制激活稀疏性并将稠密模型转化为稀疏专家混合(MoE)版本,从根本上降低Transformer模型的推理成本。我们证明,可以训练出能够成功预测推理过程中每个专家相对贡献的小型门控网络。此外,我们引入一种机制,能够为每个词元动态确定执行的专家数量。DSTI可应用于任何基于Transformer的架构,且对准确率的影响可以忽略不计。对于BERT-base分类模型,我们将推理成本降低了近60%。