The rapid development of large language models(LLMs) has led to remarkable advances in natural language processing. However, the increasing scale of these models introduces substantial challenges in terms of storage, transmission, and deployment. Though great efforts have been devoted to model compression and quantization, existing methods often rely on fine-tuning or calibration data, which exhibit limited generalization across different tensor types. In this paper, we argue that video codecs offer a promising solution for LLM compression, due to their inherent compatibility with matrix structured data, configurable compression strategies, and the availability of highly optimized, off-the-shelf implementations. Therefore, we present LLMCodec, a video codec-based LLM compression method that integrates affine quantization with the recent VVC/H.266 video codec. Beyond VVC, we further compare a range of video codecs and encoding profiles to evaluate their impact on compression performance. Experiments on different models demonstrate the robustness and generality of LLMCodec. Notably, on LLaMA-3-8B at 2-bit precision, LLMCodec reduces perplexity by over 1.5x and improves downstream task accuracy by 21% compared with the existing method.
翻译:大型语言模型(LLMs)的快速发展推动了自然语言处理领域的显著进步。然而,模型规模的不断增大在存储、传输和部署方面带来了严峻挑战。尽管已有大量研究致力于模型压缩与量化,现有方法通常依赖于微调或校准数据,且在不同张量类型上的泛化能力有限。本文提出,视频编解码器因其与矩阵结构化数据的天然兼容性、可配置的压缩策略,以及高度优化且可直接应用的实现,为LLM压缩提供了一种极具前景的解决方案。为此,我们提出了LLMCodec——一种基于视频编解码器的LLM压缩方法,该方法将仿射量化与最新的VVC/H.266视频编解码器相结合。除VVC外,我们还进一步对比了多种视频编解码器及其编码配置文件,以评估其对压缩性能的影响。不同模型上的实验证明了LLMCodec的鲁棒性与通用性。值得注意的是,在2比特精度的LLaMA-3-8B模型上,与现有方法相比,LLMCodec将困惑度降低了超过1.5倍,并将下游任务准确率提升了21%。