The transformative impact of large language models (LLMs) like LLaMA and GPT on natural language processing is countered by their prohibitive computational demands. Pruning has emerged as a pivotal compression strategy, introducing sparsity to enhance both memory and computational efficiency. Yet, traditional global pruning is impractical for LLMs due to scalability issues, while local pruning, despite its efficiency, leads to suboptimal solutions. Addressing these challenges, we propose SparseLLM, a novel framework that redefines the global pruning process into manageable, coordinated subproblems, allowing for resource-efficient optimization with global optimality. SparseLLM's approach, which conceptualizes LLMs as a chain of modular functions and leverages auxiliary variables for problem decomposition, not only facilitates a pragmatic application on LLMs but also demonstrates significant performance improvements, particularly in high-sparsity regimes where it surpasses current state-of-the-art methods.
翻译:以LLaMA和GPT为代表的大规模语言模型(LLMs)虽对自然语言处理领域产生了变革性影响,但其巨大的计算需求构成了实际部署的障碍。剪枝作为一种关键的模型压缩策略,通过引入稀疏性来提升内存与计算效率。然而,传统的全局剪枝方法因可扩展性问题难以适用于LLMs;而局部剪枝方法虽计算高效,却往往得到次优解。针对这些挑战,我们提出SparseLLM——一种新颖的框架,该框架将全局剪枝过程重新定义为可管理、可协调的子问题,从而在保证资源高效优化的同时实现全局最优性。SparseLLM将LLMs概念化为模块化函数的链式结构,并利用辅助变量进行问题分解,这一设计不仅实现了对LLMs的实用化剪枝操作,更在性能上展现出显著提升,尤其在高度稀疏场景下超越了当前最先进的方法。