The advent of 1-bit large language models (LLMs), led by BitNet b1.58, has spurred interest in ternary LLMs. Despite this, research and practical applications focusing on efficient edge inference for ternary LLMs remain scarce. To bridge this gap, we introduce Bitnet.cpp, an inference system optimized for BitNet b1.58 and ternary LLMs. Given that mixed-precision matrix multiplication (mpGEMM) constitutes the bulk of inference time in ternary LLMs, Bitnet.cpp incorporates a novel mpGEMM library to facilitate sub-2-bits-per-weight, efficient and lossless inference. The library features two core solutions: Ternary Lookup Table (TL), which addresses spatial inefficiencies of previous bit-wise methods, and Int2 with a Scale (I2_S), which ensures lossless edge inference, both enabling high-speed inference. Our experiments show that Bitnet.cpp achieves up to a 6.25x increase in speed over full-precision baselines and up to 2.32x over low-bit baselines, setting new benchmarks in the field. Additionally, we expand TL to element-wise lookup table (ELUT) for low-bit LLMs in the appendix, presenting both theoretical and empirical evidence of its considerable potential. Bitnet.cpp is publicly available at https://github.com/microsoft/BitNet/tree/paper , offering a sophisticated solution for the efficient and practical deployment of edge LLMs.
翻译:由BitNet b1.58引领的1比特大语言模型(LLM)的出现,激发了对三元大语言模型的兴趣。尽管如此,专注于三元大语言模型高效边缘推理的研究与实际应用仍然稀缺。为填补这一空白,我们推出了Bitnet.cpp,一个为BitNet b1.58及三元大语言模型优化的推理系统。鉴于混合精度矩阵乘法(mpGEMM)构成了三元大语言模型推理时间的主要部分,Bitnet.cpp集成了一种新颖的mpGEMM库,以支持每权重低于2比特、高效且无损的推理。该库包含两个核心解决方案:三元查找表(TL),它解决了先前按位方法的空间低效问题;以及带缩放因子的Int2(I2_S),它确保了无损的边缘推理,两者共同实现了高速推理。我们的实验表明,Bitnet.cpp相比全精度基线实现了高达6.25倍的加速,相比低比特基线实现了高达2.32倍的加速,为该领域设定了新的基准。此外,我们在附录中将TL扩展为面向低比特大语言模型的逐元素查找表(ELUT),并从理论和实证两方面展示了其巨大的潜力。Bitnet.cpp已在 https://github.com/microsoft/BitNet/tree/paper 公开提供,为高效且实际地部署边缘大语言模型提供了一个成熟的解决方案。