BitMoD: Bit-serial Mixture-of-Datatype LLM Acceleration

Large language models (LLMs) have demonstrated remarkable performance across various machine learning tasks. Yet the substantial memory footprint of LLMs significantly hinders their deployment. In this paper, we improve the accessibility of LLMs through BitMoD, an algorithm-hardware co-design solution that enables efficient LLM acceleration at low weight precision. On the algorithm side, BitMoD introduces fine-grained data type adaptation that uses a different numerical data type to quantize a group of (e.g., 128) weights. Through the careful design of these new data types, BitMoD is able to quantize LLM weights to very low precision (e.g., 4 bits and 3 bits) while maintaining high accuracy. On the hardware side, BitMoD employs a bit-serial processing element to easily support multiple numerical precisions and data types; our hardware design includes two key innovations: First, it employs a unified representation to process different weight data types, thus reducing the hardware cost. Second, it adopts a bit-serial dequantization unit to rescale the per-group partial sum with minimal hardware overhead. Our evaluation on six representative LLMs demonstrates that BitMoD significantly outperforms state-of-the-art LLM quantization and acceleration methods. For discriminative tasks, BitMoD can quantize LLM weights to 4-bit with $<\!0.5\%$ accuracy loss on average. For generative tasks, BitMoD is able to quantize LLM weights to 3-bit while achieving better perplexity than prior LLM quantization scheme. Combining the superior model performance with an efficient accelerator design, BitMoD achieves an average of $1.69\times$ and $1.48\times$ speedups compared to prior LLM accelerators ANT and OliVe, respectively.

翻译：大型语言模型（LLM）在各种机器学习任务中展现出卓越性能。然而，LLM庞大的内存占用严重阻碍了其实际部署。本文提出BitMoD，一种算法-硬件协同设计解决方案，旨在通过低权重精度实现高效的LLM加速，从而提升LLM的可及性。在算法层面，BitMoD引入了细粒度数据类型自适应方法，采用不同的数值数据类型对每组（例如128个）权重进行量化。通过对这些新型数据类型的精心设计，BitMoD能够将LLM权重量化至极低精度（例如4比特和3比特），同时保持高准确率。在硬件层面，BitMoD采用比特串行处理单元以灵活支持多种数值精度和数据类型；我们的硬件设计包含两项关键创新：首先，它采用统一表示来处理不同的权重数据类型，从而降低硬件成本。其次，它采用比特串行反量化单元，以最小硬件开销对每组部分和进行重新缩放。我们在六个代表性LLM上的评估表明，BitMoD显著优于最先进的LLM量化与加速方法。对于判别式任务，BitMoD可将LLM权重量化为4比特，平均准确率损失$<\!0.5\%$。对于生成式任务，BitMoD能够将LLM权重量化为3比特，同时获得比先前LLM量化方案更优的困惑度。凭借卓越的模型性能与高效的加速器设计相结合，BitMoD相较于现有LLM加速器ANT和OliVe，分别实现了平均$1.69\times$和$1.48\times$的加速比。