For reasons such as privacy, there are use cases for language models at the edge. This has given rise to small language models targeted for deployment in resource-constrained devices where energy efficiency is critical. Spiking neural networks (SNNs) offer a promising solution due to their energy efficiency, and there are already works on realizing transformer-based models on SNNs. However, key operations like softmax and layer normalization (LN) are difficult to implement on neuromorphic hardware, and many of these early works sidestepped them. To address these challenges, we introduce Sorbet, a transformer-based spiking language model that is more neuromorphic hardware-compatible. Sorbet incorporates a novel shifting-based softmax called PTsoftmax and a Bit Shifting PowerNorm (BSPN), both designed to replace the respective energy-intensive operations. By leveraging knowledge distillation and model quantization, Sorbet achieved a highly compressed binary weight model that maintains competitive performance while achieving $27.16\times$ energy savings compared to BERT. We validate Sorbet through extensive testing on the GLUE benchmark and a series of ablation studies, demonstrating its potential as an energy-efficient solution for language model inference. Our code is publicly available at \href{https://github.com/Kaiwen-Tang/Sorbet}{https://github.com/Kaiwen-Tang/Sorbet}
翻译:出于隐私等原因,语言模型在边缘计算场景中存在应用需求。这催生了面向资源受限设备部署的小型语言模型,其中能效至关重要。脉冲神经网络(SNNs)因其高能效特性提供了有前景的解决方案,目前已有研究致力于在SNN上实现基于Transformer的模型。然而,softmax和层归一化(LN)等关键操作在神经形态硬件上难以实现,许多早期研究回避了这些问题。为应对这些挑战,我们提出了Sorbet——一种与神经形态硬件兼容性更强的基于Transformer的脉冲语言模型。Sorbet引入了名为PTsoftmax的新型基于位移的softmax以及位移幂归一化(BSPN),二者分别用于替代原有的高能耗运算。通过知识蒸馏与模型量化技术,Sorbet实现了高度压缩的二值权重模型,在保持竞争力性能的同时,相比BERT实现了$27.16\times$的能耗节约。我们在GLUE基准测试和系列消融实验中验证了Sorbet的有效性,证明了其作为语言模型推理能效解决方案的潜力。代码已公开于\href{https://github.com/Kaiwen-Tang/Sorbet}{https://github.com/Kaiwen-Tang/Sorbet}。