It is widely acknowledged that large and sparse models have higher accuracy than small and dense models under the same model size constraints. This motivates us to train a large model and then remove its redundant neurons or weights by pruning. Most existing works pruned the networks in a deterministic way, the performance of which solely depends on a single pruning criterion and thus lacks variety. Instead, in this paper, we propose a model pruning strategy that first generates several pruning masks in a designed random way. Subsequently, along with an effective mask-selection rule, the optimal mask is chosen from the pool of mask candidates. To further enhance efficiency, we introduce an early mask evaluation strategy, mitigating the overhead associated with training multiple masks. Our extensive experiments demonstrate that this approach achieves state-of-the-art performance across eight datasets from GLUE, particularly excelling at high levels of sparsity.
翻译:广泛认可的是,在相同模型大小约束下,大而稀疏的模型比小而密集的模型具有更高的准确率。这促使我们训练大型模型,然后通过剪枝移除其冗余神经元或权重。现有工作大多采用确定性方式进行网络剪枝,其性能仅依赖于单一剪枝准则,因此缺乏多样性。相反,本文提出一种模型剪枝策略:首先通过设计的随机方式生成多个剪枝掩码;随后结合有效的掩码选择规则,从掩码候选池中选出最优掩码。为进一步提升效率,我们引入早期掩码评估机制,以减轻训练多个掩码带来的开销。大量实验表明,该方法在GLUE基准的八个数据集上均达到最先进性能,尤其在高稀疏度水平下表现卓越。