The increasing size and complexity of Large Language Models (LLMs) pose challenges for their deployment on personal computers and mobile devices. Aggressive post-training model compression is necessary to reduce the models' size, but it often results in significant accuracy loss. To address this challenge, we propose a novel network pruning technology that utilizes over 0.7 sparsity and less than 8 bits of quantization. Our approach enables the compression of prevailing LLMs within a couple of hours while maintaining a relatively small accuracy loss. In experimental evaluations, our method demonstrates effectiveness and potential for practical deployment. By making LLMs available on domestic devices, our work can facilitate a new era of natural language processing applications with wide-ranging impacts.
翻译:大型语言模型(LLM)日益增长的规模和复杂性对其在个人计算机和移动设备上的部署提出了挑战。为减小模型尺寸,激进的训练后模型压缩是必要的,但这通常会导致显著的精度损失。为应对这一挑战,我们提出了一种新颖的网络剪枝技术,该技术利用超过0.7的稀疏度和低于8位的量化。我们的方法能够在数小时内压缩主流的LLM,同时保持相对较小的精度损失。在实验评估中,我们的方法展示了其有效性及实际部署的潜力。通过使LLM能够在个人设备上运行,我们的工作有望推动一个具有广泛影响的新自然语言处理应用时代的到来。