Spiking Neural Networks (SNNs) offer superior energy efficiency over Artificial Neural Networks (ANNs). However, they encounter significant deficiencies in training and inference metrics when applied to Spiking Vision Transformers (S-ViTs). Existing paradigms including ANN-SNN Conversion and Spatial-Temporal Backpropagation (STBP) suffer from inherent limitations, precluding concurrent optimization of memory, accuracy and energy consumption. To address these issues, we propose Ge$^\text{2}$mS-T, a novel architecture implementing grouped computation across temporal, spatial and network structure dimensions. Specifically, we introduce the Grouped-Exponential-Coding-based IF (ExpG-IF) model, enabling lossless conversion with constant training overhead and precise regulation for spike patterns. Additionally, we develop Group-wise Spiking Self-Attention (GW-SSA) to reduce computational complexity via multi-scale token grouping and multiplication-free operations within a hybrid attention-convolution framework. Experiments confirm that our method can achieve superior performance with ultra-high energy efficiency on challenging benchmarks. To our best knowledge, this is the first work to systematically establish multi-dimensional grouped computation for resolving the triad of memory overhead, learning capability and energy budget in S-ViTs.
翻译:脉冲神经网络(SNNs)相比人工神经网络(ANNs)具有更优的能效特性,但在应用于脉冲视觉Transformer(S-ViTs)时,其训练与推理指标存在显著缺陷。现有范式(包括ANN-SNN转换和时空反向传播STBP)存在固有限制,无法同时优化内存、精度与能耗。针对这些问题,我们提出Ge$^\text{2}$mS-T——一种在时间、空间及网络结构维度上实现分组计算的新型架构。具体而言,我们引入基于分组指数编码的IF模型(ExpG-IF),可在恒定训练开销下实现无损转换,并对脉冲模式进行精确调控。此外,我们开发了分组脉冲自注意力机制(GW-SSA),通过多尺度令牌分组和混合注意力-卷积框架中的无乘法运算降低计算复杂度。实验表明,本方法能在高难度基准测试中实现优越性能与超高能效。据我们所知,这是首个系统性地建立多维分组计算以解决S-ViTs中内存开销、学习能力与能量预算三重难题的工作。