General speech restoration demands techniques that can interpret complex speech structures under various distortions. While State-Space Models like SEMamba have advanced the state-of-the-art in speech denoising, they are not inherently optimized for critical speech characteristics, such as spectral periodicity or multi-resolution frequency analysis. In this work, we introduce an architecture tailored to incorporate speech-specific features as inductive biases. In particular, we propose the Global, Local, and Periodic (GLP) module, a frequency feature extraction block that effectively and efficiently leverages the properties of frequency bins. Then, we design a multi-resolution parallel time-frequency dual-processing block to capture diverse spectral patterns, and a learnable mapping to further enhance model performance. With all our ideas combined, the proposed SEMamba++ achieves the best performance among multiple baseline models while remaining computationally efficient.
翻译:通用语音恢复技术需要能够解释复杂语音结构并应对多种失真的方法。尽管SEMamba等状态空间模型在语音降噪领域已取得领先成果,但其本身并未针对关键语音特征(如频谱周期性或多分辨率频率分析)进行优化。本文提出了一种将语音特异性特征作为归纳偏置的定制化架构。具体而言,我们设计了全局-局部-周期性(GLP)模块——一种高效利用频点特性的频率特征提取块;随后构建了多分辨率并行时频双处理模块以捕获多样性频谱模式,并通过可学习映射进一步强化模型性能。结合上述创新,所提出的SEMamba++在保持计算高效性的同时,在多个基线模型中取得了最优性能。