Large Language Models (LLMs) have transformed the landscape of artificial intelligence, while their enormous size presents significant challenges in terms of computational costs. We introduce LoRAShear, a novel efficient approach to structurally prune LLMs and recover knowledge. Given general LLMs, LoRAShear at first creates the dependency graphs over LoRA modules to discover minimally removal structures and analyze the knowledge distribution. It then proceeds progressive structured pruning on LoRA adaptors and enables inherent knowledge transfer to better preserve the information in the redundant structures. To recover the lost knowledge during pruning, LoRAShear meticulously studies and proposes a dynamic fine-tuning schemes with dynamic data adaptors to effectively narrow down the performance gap to the full models. Numerical results demonstrate that by only using one GPU within a couple of GPU days, LoRAShear effectively reduced footprint of LLMs by 20% with only 1.0% performance degradation and significantly outperforms state-of-the-arts. The source code will be available at https://github.com/microsoft/lorashear.
翻译:大型语言模型(LLMs)已彻底改变人工智能格局,但其庞大规模在计算成本方面带来重大挑战。我们提出LoRAShear,一种高效的结构化剪枝与知识恢复新方法。针对通用LLMs,LoRAShear首先构建基于LoRA模块的依赖图,以发现最小移除结构并分析知识分布;随后对LoRA适配器实施渐进式结构化剪枝,通过内在知识迁移更好地保留冗余结构中的信息。为恢复剪枝过程中损失的知识,LoRAShear精心研究并提出动态微调方案,结合动态数据适配器有效缩小与完整模型的性能差距。数值结果表明,仅需单GPU并在数天内运行,LoRAShear即可将LLMs的存储占用减少20%,性能仅下降1.0%,显著优于现有最先进方法。源代码将公布于https://github.com/microsoft/lorashear。