Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.
翻译:脉冲视觉Transformer(S-ViTs)为节能视觉学习提供了有前景的框架。然而,现有设计仍受限于两个根本问题:二元脉冲编码的信息容量有限,以及全局自注意力带来的密集标记交互。为解决这些挑战,本文提出BSViT——一种爆发脉冲驱动的视觉Transformer,其核心是双通道爆发脉冲自注意力(DBSSA)机制。DBSSA使用二元脉冲编码查询向量,并以爆发脉冲编码键向量,从而增强表示能力。值通路采用双兴奋性-抑制性二元通道,实现符号调制和更丰富的脉冲交互。值得强调的是,整个注意力运算仅保留加法计算,确保与高效神经形态硬件的兼容性。为进一步降低脉冲活动并融入空间先验,引入补丁邻域掩蔽策略将注意力限制在局部区域,从而产生结构感知稀疏性并降低计算开销。此外,爆发脉冲编码被系统性地集成到整个网络中,以超越传统二元脉冲的脉冲级表示容量。在静态和事件驱动视觉基准上的大量实验表明,BSViT在准确率上持续优于现有脉冲Transformer,同时保持具有竞争力的能效。