The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose Adaptive Global Pruning (AdaGP), a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. AdaGP's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.
翻译:大型语言模型(LLMs,如LLaMA和GPT)对自然语言处理产生了变革性影响,但其高昂的计算需求构成了制约。剪枝作为一种关键的压缩策略,通过引入稀疏性来提升内存和计算效率。然而,传统全局剪枝因可扩展性问题不适用于LLMs,而局部剪枝虽高效,却会导致次优解。针对这些挑战,我们提出自适应全局剪枝(AdaGP)——一种新型框架,将全局剪枝过程重新定义为可协调的子问题,从而在实现全局最优的同时优化资源效率。AdaGP将LLMs视为模块化函数链,并利用辅助变量进行问题分解,不仅使其在LLMs上具有实际可应用性,还显著提升了性能,特别是在高稀疏性场景下超越了现有最优方法。