Pre-trained language models achieve superior performance but are computationally expensive. Techniques such as pruning and knowledge distillation have been developed to reduce their sizes and latencies. In this work, we propose a structured pruning method GRAIN (Gradient-based Intra-attention pruning), which performs task-specific pruning with knowledge distillation and yields highly effective models. Different from common approaches that prune each attention head as a whole, GRAIN inspects and prunes intra-attention structures, which greatly expands the structure search space and enables more flexible models. We also propose a gradient separation strategy that reduces the interference of distillation on pruning for a better combination of the two approaches. Experiments on GLUE, SQuAD, and CoNLL 2003 show that GRAIN notably outperforms other methods, especially in the high sparsity regime, and achieves $6\sim7\times$ speedups while maintaining $93\%\sim99\%$ performance. Under extreme compression where only $3\%$ transformer weights remain, the pruned model is still competitive compared to larger models.
翻译:预训练语言模型虽性能卓越,但计算成本高昂。剪枝与知识蒸馏等技术已被开发用于压缩模型规模并降低推理延迟。本文提出结构化剪枝方法GRAIN(基于梯度的内部注意力剪枝),通过结合知识蒸馏进行任务特定剪枝,生成高效模型。与常规的逐注意力头整体剪枝策略不同,GRAIN深入检测并修剪注意力内部结构,极大扩展了结构搜索空间,使模型更加灵活。我们还提出梯度分离策略,减少蒸馏对剪枝的干扰,实现两种方法的更好融合。在GLUE、SQuAD与CoNLL 2003基准测试上的实验表明,GRAIN显著优于其他方法,尤其在高稀疏度场景下,在保持93%~99%性能的同时实现6~7倍加速。即使在仅保留3%Transformer权重的极端压缩条件下,剪枝模型仍能与更大规模模型保持竞争力。