Spike-Aware C++ INT8 Inference for Sparse Spiking Language Models on Commodity CPUs

Spiking language models expose activation sparsity that dense Transformer runtimes do not directly exploit. This paper studies that property from a systems perspective. Building on the SymbolicLight V1 spike-gated language model family, we implement a C++ CPU inference runtime that treats sparse binary spike states as an execution primitive rather than only applying post-hoc weight compression. The runtime combines a manifest-driven weight loader, mixed row/column memory layout, AVX2/FMA kernels, per-channel symmetric INT8 quantization, and integer-domain accumulation for spike-conditioned sparse paths. On an AMD Ryzen 7 5800X, an early scalar FP32 baseline decodes at 9.5 tokens/s. Mixed-layout AVX2 FP32 raises this to 14.7 tokens/s, and AVX2 INT8 reaches 19.9 tokens/s on the same step-30k export while reducing the weight footprint from 3.49 GB to 1.06 GB. For the available 186k-step 874M-parameter INT8 export, the C++ runtime decodes at 22.63 tokens/s in a single-thread CPU benchmark, compared with 16.31 tokens/s for TinyLlama-1.1B Q8_0, 11.26 tokens/s for Falcon3-1B Q8_0, and 9.70 tokens/s for Qwen2.5-1.5B Q8_0 under llama.cpp. Thread scaling reaches 47.90 tokens/s at four CPU threads, and 512-token prefill improves from 29.86 to 94.68 tokens/s from one to eight threads. The throughput result comes with a quality cost: the SNN reports WikiText-2 perplexity 24.80, worse than the dense baselines in the same benchmark. We frame the result as an inference-systems study for sparse language runtimes, with longer-term motivation in embodied and edge agents that may benefit from local, low-core inference near sensors and actuators. Spike-aware execution can improve CPU throughput and memory behavior for sparse spiking language models, while model quality, controlled dense training baselines, embodied-task evaluation, and measured CPU energy remain open problems.

翻译：脉冲语言模型展现出激活稀疏性，而密集Transformer运行时无法直接利用这一特性。本文从系统角度研究该属性。基于SymbolicLight V1脉冲门控语言模型家族，我们实现了一个C++ CPU推理运行时，将稀疏二进制脉冲状态视为执行原语，而不仅限于应用后处理的权重压缩。该运行时结合了清单驱动的权重加载器、混合行/列内存布局、AVX2/FMA内核、逐通道对称INT8量化以及脉冲条件稀疏路径的整数域累积。在AMD Ryzen 7 5800X上，早期标量FP32基线以9.5 tokens/s的速度解码。混合布局AVX2 FP32将其提升至14.7 tokens/s，而AVX2 INT8在相同步骤30k导出版本上达到19.9 tokens/s，同时将权重占用从3.49 GB降至1.06 GB。对于可用的186k步874M参数INT8导出版本，C++运行时在单线程CPU基准测试中以22.63 tokens/s的速度解码，而TinyLlama-1.1B Q8_0为16.31 tokens/s，Falcon3-1B Q8_0为11.26 tokens/s，Qwen2.5-1.5B Q8_0在llama.cpp下为9.70 tokens/s。线程扩展在四个CPU线程时达到47.90 tokens/s，而512令牌预填从单线程的29.86 tokens/s提升至八线程的94.68 tokens/s。这一吞吐量结果伴随着质量代价：SNN在WikiText-2上报告困惑度24.80，劣于同一基准测试中的密集基线。我们将结果定位为稀疏语言运行时的推理系统研究，其长期动机在于可能受益于传感器和执行器附近本地低核推理的具身智能和边缘智能体。尖峰感知执行能提升稀疏脉冲语言模型的CPU吞吐量和内存行为，而模型质量、受控密集训练基线、具身任务评估以及CPU能耗测量仍为待解决问题。