PTS-SNN: A Prompt-Tuned Temporal Shift Spiking Neural Networks for Efficient Speech Emotion Recognition

Speech Emotion Recognition (SER) is widely deployed in Human-Computer Interaction, yet the high computational cost of conventional models hinders their implementation on resource-constrained edge devices. Spiking Neural Networks (SNNs) offer an energy-efficient alternative due to their event-driven nature; however, their integration with continuous Self-Supervised Learning (SSL) representations is fundamentally challenged by distribution mismatch, where high-dynamic-range embeddings degrade the information coding capacity of threshold-based neurons. To resolve this, we propose Prompt-Tuned Spiking Neural Networks (PTS-SNN), a parameter-efficient neuromorphic adaptation framework that aligns frozen SSL backbones with spiking dynamics. Specifically, we introduce a Temporal Shift Spiking Encoder to capture local temporal dependencies via parameter-free channel shifts, establishing a stable feature basis. To bridge the domain gap, we devise a Context-Aware Membrane Potential Calibration strategy. This mechanism leverages a Spiking Sparse Linear Attention module to aggregate global semantic context into learnable soft prompts, which dynamically regulate the bias voltages of Parametric Leaky Integrate-and-Fire (PLIF) neurons. This regulation effectively centers the heterogeneous input distribution within the responsive firing range, mitigating functional silence or saturation. Extensive experiments on five multilingual datasets (e.g., IEMOCAP, CASIA, EMODB) demonstrate that PTS-SNN achieves 73.34\% accuracy on IEMOCAP, comparable to competitive Artificial Neural Networks (ANNs), while requiring only 1.19M trainable parameters and 0.35 mJ inference energy per sample.

翻译：语音情感识别（SER）在人机交互中应用广泛，然而传统模型的高计算成本阻碍了其在资源受限的边缘设备上的部署。脉冲神经网络（SNNs）凭借其事件驱动的特性，提供了一种高能效的替代方案；然而，将其与连续的自监督学习（SSL）表征集成时，面临着分布失配的根本性挑战，即高动态范围的嵌入会降低基于阈值的神经元的信息编码能力。为解决此问题，我们提出了提示调优脉冲神经网络（PTS-SNN），一种参数高效的神经形态适应框架，用于对齐冻结的SSL主干网络与脉冲动态特性。具体而言，我们引入了一个时序移位脉冲编码器，通过无参数的通道移位来捕获局部时序依赖性，从而建立一个稳定的特征基础。为了弥合领域差距，我们设计了一种上下文感知膜电位校准策略。该机制利用一个脉冲稀疏线性注意力模块，将全局语义上下文聚合到可学习的软提示中，这些提示动态地调节参数化泄漏积分发放（PLIF）神经元的偏置电压。这种调节有效地将异构输入分布集中在响应的发放范围内，从而缓解功能沉默或饱和问题。在五个多语言数据集（例如，IEMOCAP、CASIA、EMODB）上进行的大量实验表明，PTS-SNN在IEMOCAP上达到了73.34%的准确率，与有竞争力的人工神经网络（ANNs）相当，同时仅需119万个可训练参数，且每样本推理能耗仅为0.35毫焦。