With the growth of model sizes and the scale of their deployment, their sheer size burdens the infrastructure requiring more network and more storage to accommodate these. While there is a vast model compression literature deleting parts of the model weights for faster inference, we investigate a more traditional type of compression - one that represents the model in a compact form and is coupled with a decompression algorithm that returns it to its original form and size - namely lossless compression. We present ZipNN a lossless compression tailored to neural networks. Somewhat surprisingly, we show that specific lossless compression can gain significant network and storage reduction on popular models, often saving 33% and at times reducing over 50% of the model size. We investigate the source of model compressibility and introduce specialized compression variants tailored for models that further increase the effectiveness of compression. On popular models (e.g. Llama 3) ZipNN shows space savings that are over 17% better than vanilla compression while also improving compression and decompression speeds by 62%. We estimate that these methods could save over an ExaByte per month of network traffic downloaded from a large model hub like Hugging Face.
翻译:随着模型规模及其部署范围的扩大,其庞大的体量对基础设施造成了负担,需要更多网络和存储资源来容纳这些模型。尽管现有大量模型压缩研究通过删除部分模型权重以实现更快推理,我们探索了一种更为传统的压缩方式——即以紧凑形式表示模型,并配合解压算法使其恢复原始形态与尺寸的无损压缩。本文提出ZipNN,一种专为神经网络设计的无损压缩方法。令人惊讶的是,我们发现针对性的无损压缩能在主流模型上实现显著的网络与存储缩减,通常可节省33%的空间,有时甚至能减少超过50%的模型体积。我们深入探究了模型可压缩性的来源,并针对模型特性设计了专用压缩变体,进一步提升了压缩效能。在主流模型(如Llama 3)上,ZipNN的空间节省效果比通用压缩方法提升超过17%,同时将压缩与解压速度提高了62%。据估算,这些方法每月可为Hugging Face等大型模型集散中心节省超过1艾字节的网络下行流量。