Artificial neural networks (ANNs), particularly large language models (LLMs), demonstrate powerful inference capabilities but consume substantial energy. Conversely, spiking neural networks (SNNs) exhibit exceptional energy efficiency due to their binary and event-driven characteristics, thus motivating the study of ANN-to-SNN conversion. In this process, quantization plays a pivotal role, mapping LLMs' floating-point parameters to discrete SNN parameters via the temporal dimension of the time window. However, several challenges remain in the conversion process: (i) converting high bit-width quantization values into binary spikes requires longer time windows, increasing system latency; and (ii) the inherent trade-off between the information loss of single-spike schemes and the energy costs of multi-spike ones in SNN. To address these challenges, we propose Kirin, a integer and spike hybrid based SNN to achieve accuracy lossless ANN-to-SNN conversion with time and energy efficiency. Specifically, we first propose a Spike Matrix Hybridization strategy that encoding low bit-width parameters that leading to small time window size into binary spikes while preserving the rest in integer format, thereby reducing the overall latency of SNN execution. Second, we introduce a silence threshold mechanism to regulate the timing of single-spike firing, ensuring the output is mathematically equivalent to the LLM's output and preserves accuracy. Experimental results demonstrate that Kirin, under a W4A4\&8 quantization setting, achieves near-FP16 accuracy while reducing energy consumption by up to 84.66\% and shortening time steps by 93.75\%.
翻译:人工神经网络(ANNs),尤其是大语言模型(LLMs),展现出强大的推理能力,但能耗巨大。相反,脉冲神经网络(SNNs)因其二值化和事件驱动的特性,表现出卓越的能效,这推动了ANN到SNN转换的研究。在此过程中,量化起着关键作用,它通过时间窗口的时间维度将LLMs的浮点参数映射到离散的SNN参数。然而,转换过程中仍存在若干挑战:(i)将高比特位宽量化值转换为二值脉冲需要更长的时间窗口,从而增加系统延迟;以及(ii)SNN中单脉冲方案的信息损失与多脉冲方案的能量成本之间固有的权衡。为应对这些挑战,我们提出Kirin,一种基于整数与脉冲混合的SNN,旨在实现无损精度、时间高效且能量高效的ANN到SNN转换。具体而言,我们首先提出一种脉冲矩阵混合策略,将导致小时间窗口尺寸的低比特位宽参数编码为二值脉冲,同时将其余参数保留为整数格式,从而降低SNN执行的整体延迟。其次,我们引入静默阈值机制来调控单脉冲的发放时机,确保输出在数学上等价于LLM的输出并保持精度。实验结果表明,在W4A4&8量化设置下,Kirin在实现接近FP16精度的同时,能耗降低高达84.66%,时间步长缩短93.75%。