As the parameter size of large language models (LLMs) continues to expand, the need for a large memory footprint and high communication bandwidth have become significant bottlenecks for the training and inference of LLMs. To mitigate these bottlenecks, various tensor compression techniques have been proposed to reduce the data size, thereby alleviating memory requirements and communication pressure. Our research found that video codecs, despite being originally designed for compressing videos, show excellent efficiency when compressing various types of tensors. We demonstrate that video codecs can be versatile and general-purpose tensor codecs while achieving the state-of-the-art compression efficiency in various tasks. We further make use of the hardware video encoding and decoding module available on GPUs to create a framework capable of both inference and training with video codecs repurposed as tensor codecs. This greatly reduces the requirement for memory capacity and communication bandwidth, enabling training and inference of large models on consumer-grade GPUs.
翻译:随着大语言模型(LLMs)参数规模的持续扩大,巨大的内存占用和高通信带宽需求已成为LLM训练与推理的显著瓶颈。为缓解这些瓶颈,学界提出了多种张量压缩技术以减少数据规模,从而降低内存需求并缓解通信压力。我们的研究发现,视频编解码器虽最初为压缩视频而设计,但在压缩各类张量时展现出卓越的效率。我们证明视频编解码器可作为通用型张量编解码器,并在多种任务中达到最先进的压缩效率。我们进一步利用GPU上可用的硬件视频编码与解码模块,构建了一个能够将视频编解码器改造为张量编解码器进行推理与训练的框架。该框架大幅降低了对内存容量与通信带宽的需求,使得在消费级GPU上训练与推理大模型成为可能。