VL2Spike: Spike-driven Distillation from VLMs for Low-Power Visual Perception in Embodied AI

Spiking neural networks (SNNs) are brain-inspired, event-driven models that compute with sparse spikes, which enables highly efficient visual perception in resource-constrained embodied AI models. The emergence of Spiking-Transformer models with spike self-attention has substantially improved the learning capacity of pure SNNs. Although SNNs are energy efficient, their performance is still limited by the spike-based architecture and optimization challenges, as standard gradient descent rules cannot be directly applied. Recently, vision-language models (VLMs) have shown rich multi-modal knowledge representation capabilities for visual perception. Thus, it is promising to leverage VLMs for better Spikformer training. To this end, we present VL2Spike, a novel spike-based knowledge distillation (KD) framework that bridges multi-modal knowledge from VLMs with compact Spikformer models. This design enhances the learning capacity of Spikformer models while preserving their energy-efficiency merits, thereby offering a practical pathway toward low-power robotic perception. Our VL2Spike brings two key technical contributions. To align with spiking dynamics, we first propose spatial-temporal visual spike (SVS) distillation, which achieves (1) shared manifold alignment between VLM image features and spike tokens, and (2) warm-started temporal consistency on membrane potentials and spike rates. We then design a novel spike prototype-guided linguistic (SPL) distillation strategy that aligns Spikformer's class prototypes and logits with promptable VLM text embeddings. Extensive experiments show that VL2Spike achieves 6.81% gain across three static datasets with only 15.7% energy consumption. It also exhibits strong generalization capacity on robotic visual place recognition (VPR) with a gain of 6.63%, highlighting its potential for low-power perception in embodied AI.

翻译：脉冲神经网络（SNN）作为受脑启发的脉冲驱动模型，通过稀疏脉冲进行计算，能够在资源受限的具身AI模型中实现高效视觉感知。基于脉冲自注意力的Spiking-Transformer模型的出现显著提升了纯SNN的学习能力。尽管SNN具有能效优势，但其性能仍受限于基于脉冲的架构和优化挑战——标准梯度下降规则无法直接适用。近期研究表明，视觉-语言模型（VLM）在视觉感知领域展现出丰富的多模态知识表征能力。因此，利用VLM优化Spikformer训练具有重要研究价值。为此，我们提出VL2Spike——一种新颖的脉冲知识蒸馏框架，将VLM的多模态知识与紧凑型Spikformer模型相融合。该设计在保持Spikformer能效特性的同时增强其学习能力，为低功耗机器人感知提供了可行路径。VL2Spike包含两项关键技术贡献：为适配脉冲动态特性，我们首先提出时空视觉脉冲蒸馏方法，实现（1）VLM图像特征与脉冲令牌的共享流形对齐，以及（2）基于膜电位和脉冲率的热启动时序一致性约束；其次设计新型脉冲原型引导语言蒸馏策略，促使Spikformer的类别原型与逻辑值对齐可提示的VLM文本嵌入。大量实验表明，VL2Spike在三个静态数据集上仅消耗15.7%的能量即可实现6.81%的性能提升，同时在机器人视觉位置识别任务中展现出6.63%的泛化增益，凸显其在具身AI低功耗感知领域的应用潜力。