Spiking neural networks (SNNs) have demonstrated the capability to achieve comparable performance to deep neural networks (DNNs) in both visual and linguistic domains while offering the advantages of improved energy efficiency and adherence to biological plausibility. However, the extension of such single-modality SNNs into the realm of multimodal scenarios remains an unexplored territory. Drawing inspiration from the concept of contrastive language-image pre-training (CLIP), we introduce a novel framework, named SpikeCLIP, to address the gap between two modalities within the context of spike-based computing through a two-step recipe involving ``Alignment Pre-training + Dual-Loss Fine-tuning". Extensive experiments demonstrate that SNNs achieve comparable results to their DNN counterparts while significantly reducing energy consumption across a variety of datasets commonly used for multimodal model evaluation. Furthermore, SpikeCLIP maintains robust performance in image classification tasks that involve class labels not predefined within specific categories.
翻译:脉冲神经网络(SNNs)在视觉和语言领域已展现出与深度神经网络(DNNs)相当的性能,同时具备能效提升和遵循生物 plausibility 的优势。然而,将此类单模态SNNs拓展至多模态场景仍属未探索领域。受对比语言-图像预训练(CLIP)概念启发,我们提出名为SpikeCLIP的新框架,通过“对齐预训练+双损失微调”两步策略弥合尖峰计算背景下两种模态间的鸿沟。大量实验表明,SNNs在常用于多模态模型评估的各类数据集上,不仅取得了与DNNs相当的结果,还显著降低了能耗。此外,SpikeCLIP在涉及未预定义特定类别标签的图像分类任务中仍保持稳健性能。