While post-training compression techniques effectively reduce the memory footprint, latency, and power consumption of Large Language Models (LLMs), they often result in noticeable accuracy degradation and remain limited by hardware and kernel constraints that restrict supported compression formats ultimately reducing flexibility across a wide range of deployment scenarios. In this work, we propose EoRA, a novel fine-tuning-free method that augments compressed LLMs with low-rank matrices, allowing users to rapidly enhance task-specific performance and freely balance the trade-off between accuracy and computational overhead beyond the constraints of compression formats. EoRA consistently outperforms prior training-free low rank methods in recovering the accuracy of compressed LLMs, achieving notable accuracy improvements (e.g., $\mathbf{10.84\%}$ on ARC-Challenge, $\mathbf{6.74\%}$ on MathQA, and $\mathbf{11.45\%}$ on GSM8K) for LLaMA3-8B compressed to 3-bit. We also introduce an optimized CUDA kernel, accelerating inference by up to 1.4x and reducing memory overhead through quantizing EoRA. Overall, EoRA offers a prompt solution for improving the accuracy of compressed models under varying user requirements, enabling more efficient and flexible deployment of LLMs. Code is available at https://github.com/NVlabs/EoRA.
翻译:虽然训练后压缩技术能有效降低大语言模型的内存占用、延迟和功耗,但这些方法通常会导致明显的精度下降,且受限于硬件和内核约束——这些约束限制了可支持的压缩格式,最终降低了在各种部署场景中的灵活性。本研究提出EoRA,一种新颖的免微调方法,通过低秩矩阵增强压缩后的大语言模型,使用户能够快速提升任务特定性能,并自由权衡精度与计算开销之间的平衡,突破压缩格式的限制。EoRA在恢复压缩大语言模型精度方面持续优于先前的免训练低秩方法,在LLaMA3-8B压缩至3位时实现了显著的精度提升(例如,在ARC-Challenge上提升$\mathbf{10.84\%}$,在MathQA上提升$\mathbf{6.74\%}$,在GSM8K上提升$\mathbf{11.45\%}$)。我们还引入了优化的CUDA内核,通过量化EoRA将推理速度提升高达1.4倍,并减少内存开销。总体而言,EoRA为在不同用户需求下提升压缩模型精度提供了即时解决方案,实现了更高效、灵活的大语言模型部署。代码发布于https://github.com/NVlabs/EoRA。