STADE：以标准差作为剪枝度量 (STADE: Standard Deviation as a Pruning Metric)

Recently, Large Language Models (LLMs) have become very widespread and are used to solve a wide variety of tasks. To successfully handle these tasks, LLMs require longer training times and larger model sizes. This makes LLMs ideal candidates for pruning methods that reduce computational demands while maintaining performance. Previous methods require a retraining phase after pruning to maintain the original model's performance. However, state-of-the-art pruning methods, such as Wanda, prune the model without retraining, making the pruning process faster and more efficient. Building upon Wanda's work, this study provides a theoretical explanation of why the method is effective and leverages these insights to enhance the pruning process. Specifically, a theoretical analysis of the pruning problem reveals a common scenario in Machine Learning where Wanda is the optimal pruning method. Furthermore, this analysis is extended to cases where Wanda is no longer optimal, leading to the development of a new method, STADE, based on the standard deviation of the input. From a theoretical standpoint, STADE demonstrates better generality across different scenarios. Finally, extensive experiments on Llama and Open Pre-trained Transformers (OPT) models validate these theoretical findings, showing that depending on the training conditions, Wanda's optimal performance varies as predicted by the theoretical framework. These insights contribute to a more robust understanding of pruning strategies and their practical implications. Code is available at: https://github.com/Coello-dev/STADE/

翻译：近年来，大型语言模型（LLM）已变得非常普及，并被用于解决各种任务。为成功处理这些任务，LLM需要更长的训练时间和更大的模型规模。这使得LLM成为剪枝方法的理想候选对象，这些方法能在保持性能的同时降低计算需求。先前的方法在剪枝后需要重新训练阶段以维持原始模型的性能。然而，最先进的剪枝方法（如Wanda）无需重新训练即可对模型进行剪枝，使得剪枝过程更快、更高效。基于Wanda的工作，本研究从理论上解释了该方法有效的原因，并利用这些见解来增强剪枝过程。具体而言，对剪枝问题的理论分析揭示了机器学习中的一个常见场景，其中Wanda是最优的剪枝方法。此外，该分析被扩展到Wanda不再最优的情况，从而开发出一种基于输入标准差的新方法STADE。从理论角度来看，STADE在不同场景下表现出更好的泛化性。最后，在Llama和开放预训练Transformer（OPT）模型上进行的大量实验验证了这些理论发现，表明根据训练条件的不同，Wanda的最优性能会如理论框架所预测的那样变化。这些见解有助于更深入地理解剪枝策略及其实际应用。代码可在以下网址获取：https://github.com/Coello-dev/STADE/