The Emergence of Essential Sparsity in Large Pre-trained Models: The Weights that Matter

Large pre-trained transformers are show-stealer in modern-day deep learning, and it becomes crucial to comprehend the parsimonious patterns that exist within them as they grow in scale. With exploding parameter counts, Lottery Ticket Hypothesis (LTH) and its variants, have lost their pragmatism in sparsifying them due to high computation and memory bottleneck of repetitive train-prune-retrain routine of iterative magnitude pruning (IMP) which worsens with increasing model size. This paper comprehensively studies induced sparse patterns across multiple large pre-trained vision and language transformers. We propose the existence of -- essential sparsity defined with a sharp dropping point beyond which the performance declines much faster w.r.t the rise of sparsity level, when we directly remove weights with the smallest magnitudes in one-shot without re-training. We also find essential sparsity to hold valid for N:M sparsity patterns as well as on modern-scale large language models (Vicuna-7B). We also present an intriguing emerging phenomenon of abrupt sparsification during the pre-training of BERT, i.e., BERT suddenly becomes heavily sparse in pre-training after certain iterations. Moreover, our observations also indicate a counter-intuitive finding that BERT trained with a larger amount of pre-training data tends to have a better ability to condense knowledge in comparatively relatively fewer parameters. Lastly, we investigate the effect of the pre-training loss on essential sparsity and discover that self-supervised learning (SSL) objectives trigger stronger emergent sparsification properties than supervised learning (SL). Our codes are available at \url{https://github.com/VITA-Group/essential_sparsity}.

翻译：大型预训练Transformer模型是现代深度学习中的翘楚，随着其规模不断扩大，理解其中存在的简约模式变得至关重要。参数数量的爆炸式增长使得“彩票假设”（LTH）及其变体在稀疏化这些模型时失去了实用性，原因是迭代幅度剪枝（IMP）中重复的“训练-剪枝-再训练”流程带来了高计算量和内存瓶颈，且随着模型增大而恶化。本文全面研究了多个大型预训练视觉和Transformer模型中的诱导稀疏模式。我们提出存在一种“本质稀疏性”，定义为当我们在不进行再训练的情况下一次性移除最小幅度权重时，存在一个急剧下降点，超过该点后性能随着稀疏度增加而急剧下降。我们还发现，本质稀疏性在N:M稀疏模式以及现代规模的大型语言模型（Vicuna-7B）中同样有效。此外，我们观察到BERT预训练过程中出现的一种引人入胜的涌现现象——即突然稀疏化：在特定迭代次数后，BERT在预训练中突然变得高度稀疏。同时，我们的观察也揭示了一个反直觉的发现：使用更多预训练数据训练的BERT，往往能将知识更有效地压缩到相对较少的参数中。最后，我们研究了预训练损失对本质稀疏性的影响，发现自监督学习（SSL）目标比监督学习（SL）能触发更强的涌现稀疏化特性。我们的代码已开源在 \url{https://github.com/VITA-Group/essential_sparsity}。