Large language models (LLMs) require larger GPU memory size these days, necessitating efficient and extreme weight compression methods. Existing compression methods are either theoretically limited by 1 bit per weight or face severe performance degradation and inefficiency. To deploy LLMs in resource-constrained scenarios, we introduce UltraSketchLLM, compressing LLMs with data sketch. It reduces peak GPU memory footprint with a high compression rate down to 0.5 bit per weight. Combined with hardware-friendly implementation, UltraSketchLLM keeps tolerable performance degradation and extremely low latency overhead with 14.9x speedup compared to naive sketch solution.
翻译:当前大规模语言模型(LLMs)对GPU显存容量需求日益增大,亟需高效且极致的权值压缩方法。现有压缩方法要么在理论上受限于每个权值1比特的极限,要么面临严重的性能退化与效率低下问题。为在资源受限场景中部署LLMs,我们提出UltraSketchLLM——一种利用数据草图技术压缩LLM的方法。该方法可将每个权值压缩至0.5比特,实现高压缩率的同时显著降低峰值GPU显存占用。结合硬件友好型实现,UltraSketchLLM在保持可接受的性能退化程度与极低延迟开销的同时,相比朴素草图方案实现了14.9倍的加速比。