Large Language Models (LLMs) are typically trained in two phases: pre-training on large internet-scale datasets, and fine-tuning for downstream tasks. Given the higher computational demand of pre-training, it's intuitive to assume that fine-tuning adds less new information to the model, and is thus more compressible. We explore this assumption by decomposing the weights of fine-tuned models into their pre-trained components and an additional delta. We introduce a simple method, BitDelta, which successfully quantizes this delta down to 1 bit without compromising performance. This interesting finding not only highlights the potential redundancy of information added during fine-tuning, but also has significant implications for the multi-tenant serving and multi-tenant storage of fine-tuned models. By enabling the use of a single high-precision base model accompanied by multiple 1-bit deltas, BitDelta dramatically reduces GPU memory requirements by more than 10x, which can also be translated to enhanced generation latency in multi-tenant settings. We validate BitDelta through experiments across Llama-2 and Mistral model families, and on models up to 70B parameters, showcasing minimal performance degradation over all tested settings.
翻译:大型语言模型(LLM)通常分两阶段训练:在大规模互联网数据集上预训练,以及为下游任务进行微调。鉴于预训练对计算需求更高,直观上认为微调向模型添加的新信息更少,因而更易压缩。我们通过将微调模型的权重分解为预训练组件和额外增量来探索这一假设。我们提出一种简单方法BitDelta,成功将此增量量化至1位而不影响性能。这一有趣发现不仅凸显了微调期间添加信息的潜在冗余性,还对微调模型的多租户服务与存储产生重大影响。通过支持使用单个高精度基础模型搭配多个1位增量,BitDelta将GPU内存需求降低超过10倍,同时可在多租户场景中转化为更优的生成延迟。我们在Llama-2和Mistral模型家族以及高达700亿参数的模型上进行实验验证,表明在所有测试设置下性能退化极微。