Deploying pre-trained transformer models like BERT on downstream tasks in resource-constrained scenarios is challenging due to their high inference cost, which grows rapidly with input sequence length. In this work, we propose a constraint-aware and ranking-distilled token pruning method ToP, which selectively removes unnecessary tokens as input sequence passes through layers, allowing the model to improve online inference speed while preserving accuracy. ToP overcomes the limitation of inaccurate token importance ranking in the conventional self-attention mechanism through a ranking-distilled token distillation technique, which distills effective token rankings from the final layer of unpruned models to early layers of pruned models. Then, ToP introduces a coarse-to-fine pruning approach that automatically selects the optimal subset of transformer layers and optimizes token pruning decisions within these layers through improved $L_0$ regularization. Extensive experiments on GLUE benchmark and SQuAD tasks demonstrate that ToP outperforms state-of-the-art token pruning and model compression methods with improved accuracy and speedups. ToP reduces the average FLOPs of BERT by 8.1x while achieving competitive accuracy on GLUE, and provides a real latency speedup of up to 7.4x on an Intel CPU.
翻译:在资源受限场景下部署BERT等预训练Transformer模型处理下游任务具有挑战性,因其推理成本随输入序列长度快速增长。本文提出一种约束感知与排序蒸馏的Token剪枝方法ToP,该方法在输入序列逐层传递时选择性移除非必要Token,使模型在保持精度的同时提升在线推理速度。ToP通过排序蒸馏Token蒸馏技术克服传统自注意力机制中Token重要性排序不准确的局限,该技术将未剪枝模型最终层的有效Token排序蒸馏至剪枝模型的早期层。随后,ToP引入由粗到细的剪枝策略,通过改进的$L_0$正则化自动选择最优Transformer层子集,并优化这些层内的Token剪枝决策。在GLUE基准测试和SQuAD任务上的大量实验表明,ToP在提升精度和加速效果方面均优于当前最优的Token剪枝和模型压缩方法。ToP使BERT的平均FLOPs降低8.1倍,同时在GLUE上保持有竞争力的精度,并在Intel CPU上实现了高达7.4倍的实际延迟加速。