We propose a multi-dimensional structured state space (S4) approach to speech enhancement. To better capture the spectral dependencies across the frequency axis, we focus on modifying the multi-dimensional S4 layer with whitening transformation to build new small-footprint models that also achieve good performance. We explore several S4-based deep architectures in time (T) and time-frequency (TF) domains. The 2-D S4 layer can be considered a particular convolutional layer with an infinite receptive field although it utilizes fewer parameters than a conventional convolutional layer. Evaluated on the VoiceBank-DEMAND data set, when compared with the conventional U-net model based on convolutional layers, the proposed TF-domain S4-based model is 78.6% smaller in size, yet it still achieves competitive results with a PESQ score of 3.15 with data augmentation. By increasing the model size, we can even reach a PESQ score of 3.18.
翻译:我们提出了一种基于多维结构化状态空间(S4)的语音增强方法。为了更好地捕捉频率轴上的谱依赖性,我们重点通过白化变换对多维S4层进行改进,以构建性能优异且小足迹的模型。我们探索了多种基于S4的时域(T)及时频域(TF)深度架构。二维S4层可视为具有无限感受野的特殊卷积层,但其参数使用量少于传统卷积层。在VoiceBank-Demand数据集上的评估表明,与基于卷积层的传统U-net模型相比,提出的TF域S4模型体积减小78.6%,仍能通过数据增强获得具有竞争力的结果(PESQ评分3.15)。通过扩大模型规模,我们甚至能达到3.18的PESQ评分。