Attention head pruning has emerged as an effective technique for transformer model compression, an increasingly important goal in the era of Green AI. However, existing pruning methods often rely on static importance scores, which fail to capture the evolving role of attention heads during iterative removal. We propose Greedy-Gradient norm (Greedy-Gnorm), a novel head pruning algorithm that dynamically recalculates head importance after each pruning step. Specifically, each head is scored by the elementwise product of the l2-norms of its Q/K/V gradient blocks, as estimated from a hold-out validation set and updated at every greedy iteration. This dynamic approach to scoring mitigates against stale rankings and better reflects gradient-informed importance as pruning progresses. Extensive experiments on BERT, ALBERT, RoBERTa, and XLM-RoBERTa demonstrate that Greedy-Gnorm consistently preserves accuracy under substantial head removal, outperforming attention entropy. By effectively reducing model size while maintaining task performance, Greedy-Gnorm offers a promising step toward more energy-efficient transformer model deployment.
翻译:注意力头剪枝已成为Transformer模型压缩的有效技术,在绿色AI时代这一目标日益重要。然而,现有剪枝方法通常依赖于静态重要性评分,无法捕捉迭代移除过程中注意力头角色的动态变化。我们提出Greedy-Gradient范数(Greedy-Gnorm),一种新颖的注意力头剪枝算法,在每次剪枝步骤后动态重新计算头的重要性。具体而言,每个头的评分由其Q/K/V梯度块的逐元素l2范数乘积决定,该估计基于保留验证集并在每次贪婪迭代中更新。这种动态评分方法缓解了陈旧排名问题,并更好地反映了剪枝过程中梯度信息的重要性。在BERT、ALBERT、RoBERTa和XLM-RoBERTa上的大量实验表明,Greedy-Gnorm在显著移除注意力头的情况下能持续保持准确性,优于注意力熵方法。通过有效减小模型尺寸同时保持任务性能,Greedy-Gnorm为实现更节能的Transformer模型部署提供了有前景的途径。