In this work, we introduce S4M, a new efficient speech separation framework based on neural state-space models (SSM). Motivated by linear time-invariant systems for sequence modeling, our SSM-based approach can efficiently model input signals into a format of linear ordinary differential equations (ODEs) for representation learning. To extend the SSM technique into speech separation tasks, we first decompose the input mixture into multi-scale representations with different resolutions. This mechanism enables S4M to learn globally coherent separation and reconstruction. The experimental results show that S4M performs comparably to other separation backbones in terms of SI-SDRi, while having a much lower model complexity with significantly fewer trainable parameters. In addition, our S4M-tiny model (1.8M parameters) even surpasses attention-based Sepformer (26.0M parameters) in noisy conditions with only 9.2 of multiply-accumulate operation (MACs).
翻译:本文提出了S4M,一种基于神经状态空间模型的高效语音分离框架。受线性时不变系统序列建模的启发,我们的状态空间模型方法能够将输入信号高效地建模为线性常微分方程形式,用于表征学习。为将状态空间模型技术扩展到语音分离任务,我们首先将输入混合信号分解为不同分辨率的多元尺度表征。该机制使S4M能够学习全局一致的分离与重建。实验结果表明,S4M在SI-SDRi指标上与其他分离主干网络性能相当,同时模型复杂度显著降低,可训练参数大幅减少。此外,我们的S4M-tiny模型(1.8M参数)在噪声条件下甚至超越了基于注意力机制的Sepformer(26.0M参数),其乘法累加操作仅消耗9.2MACs。