Spiking neural networks (SNNs) have demonstrated the capability to achieve comparable performance to deep neural networks (DNNs) in both visual and linguistic domains while offering the advantages of improved energy efficiency and adherence to biological plausibility. However, the extension of such single-modality SNNs into the realm of multimodal scenarios remains an unexplored territory. Drawing inspiration from the concept of contrastive language-image pre-training (CLIP), we introduce a novel framework, named SpikeCLIP, to address the gap between two modalities within the context of spike-based computing through a two-step recipe involving ``Alignment Pre-training + Dual-Loss Fine-tuning". Extensive experiments demonstrate that SNNs achieve comparable results to their DNN counterparts while significantly reducing energy consumption across a variety of datasets commonly used for multimodal model evaluation. Furthermore, SpikeCLIP maintains robust performance in image classification tasks that involve class labels not predefined within specific categories.
翻译:尖峰神经网络(SNN)在视觉和语言领域中展示了与深度神经网络(DNN)相当的性能,同时具有更高的能效和更符合生物合理性的优势。然而,将这种单模态SNN扩展到多模态场景仍是一个未探索的领域。受对比语言-图像预训练(CLIP)概念的启发,我们提出了一种名为SpikeCLIP的新框架,通过“对齐预训练+双损失微调”的两步方案填补了基于脉冲计算中两种模态之间的鸿沟。大量实验表明,在多模态模型评估常用的多种数据集上,SNN在显著降低能耗的同时,达到了与DNN对应模型相当的结果。此外,SpikeCLIP在涉及未预定义特定类别标签的图像分类任务中仍保持了稳健的性能。