N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions ($\sim$50\%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions ($>$80\%). In this work, we study the effectiveness of existing sparse training recipes at \textit{high-sparsity regions} and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. To mitigate this undesirable effect, we employ decay mechanisms to progressively restrict the flow of gradients towards pruned elements. Our approach improves the model quality by up to 2$\%$ and 5$\%$ in vision and language models at high sparsity regime, respectively. We also evaluate the trade-off between model accuracy and training compute cost in terms of FLOPs. At iso-training FLOPs, our method yields better performance compared to conventional sparse training recipes, exhibiting an accuracy improvement of up to 2$\%$. The source code is available at https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity.
翻译:N:M结构化稀疏因开销相对适中且效率提升显著而备受关注。此外,由于表征开销较小,这类稀疏性在减少内存占用方面具有重要吸引力。尽管已有研究探索N:M结构化稀疏的训练方法,但它们主要聚焦于低稀疏区域(约50%)。然而,当面对高稀疏区域(>80%)时,采用这些方法训练的模型性能往往会下降。本研究系统考察了现有稀疏训练方法在高稀疏区域的有效性,指出这些方法无法维持与低稀疏区域相当的模型质量。我们证明导致这种差异的关键因素是梯度幅度中诱发的噪声水平升高。为缓解这一不利影响,我们采用衰减机制逐步限制流向剪枝元素的梯度流。在高稀疏度条件下,我们的方法在视觉模型和语言模型上的模型质量分别提升了最高2%和5%。我们还从FLOPs角度评估了模型准确率与训练计算成本之间的权衡。在等训练FLOPs条件下,与传统稀疏训练方法相比,我们的方法性能更优,准确率最高可提升2%。源代码见https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity。