The rapid development of large language models(LLMs) has led to remarkable advances in natural language processing. However, the increasing scale of these models introduces substantial challenges in terms of storage, transmission, and deployment. Though great efforts have been devoted to model compression and quantization, existing methods often rely on fine-tuning or calibration data, which exhibit limited generalization across different tensor types. In this paper, we argue that video codecs offer a promising solution for LLM compression, due to their inherent compatibility with matrix structured data, configurable compression strategies, and the availability of highly optimized, off-the-shelf implementations. Therefore, we present LLMCodec, a video codec-based LLM compression method that integrates affine quantization with the recent VVC/H.266 video codec. Beyond VVC, we further compare a range of video codecs and encoding profiles to evaluate their impact on compression performance. Experiments on different models demonstrate the robustness and generality of LLMCodec. Notably, on LLaMA-3-8B at 2-bit precision, LLMCodec reduces perplexity by over 1.5x and improves downstream task accuracy by 21% compared with the existing method.
翻译:大型语言模型的快速发展在自然语言处理领域取得了显著进步。然而,模型规模的持续扩大带来了存储、传输与部署方面的重大挑战。尽管现有方法已在模型压缩与量化领域取得重要进展,但这些方法通常依赖微调或校准数据,且对不同张量类型的泛化能力有限。本文提出,视频编码器因其与矩阵结构数据的天然兼容性、可配置的压缩策略以及高度优化的现成实现,为LLM压缩提供了极具前景的解决方案。为此,我们提出LLMCodec——基于视频编码器的LLM压缩方法,该方法将仿射量化与最新的VVC/H.266视频编码标准相结合。除VVC外,我们进一步比较了多种视频编码标准及其编码配置对压缩性能的影响。不同模型上的实验证明了LLMCodec的鲁棒性与通用性。值得注意的是,在2比特精度的LLaMA-3-8B模型上,LLMCodec将困惑度降低至现有方法的1.5倍以下,同时将下游任务准确率提升21%。