Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was recently proposed showing promising performance; however, its multi-channel performance degraded severely in real conditions. In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated. For this purpose, we redesign the key components that comprise such a system. First, we identify that the channel-modeling module's generalization to unseen scenarios can be sub-optimal and redesign this module. We further introduce a two-stage training strategy to enhance training efficiency. Second, we propose two novel dual-path time-frequency blocks, demonstrating superior performance with fewer parameters and computational costs compared to the existing method. All proposals combined, experiments on various public datasets validate the efficacy of the proposed model, with significantly improved performance on real conditions. Recipe with full model details is released at https://github.com/espnet/espnet.
翻译:构建一个能够处理任意输入的通用语音增强(SE)系统是一个备受期待但尚未充分探索的研究课题。为达成这一终极目标,一个方向是构建能够处理多样音频时长、采样频率、麦克风变化以及噪声和混响场景的单一模型,我们将其定义为“输入条件不变语音增强”。此类模型近期被提出并展现出良好性能,但其多通道性能在实际场景中严重下降。本文提出新型架构以改进输入条件不变语音增强模型,使其在模拟场景中保持竞争力,同时显著缓解实际场景中的性能退化。为此,我们重新设计了构成该系统的关键组件。首先,我们识别出信道建模模块对未见场景的泛化能力可能欠佳,并重新设计了该模块,同时引入两阶段训练策略以提升训练效率。其次,我们提出两种新型双路径时频模块,在参数更少、计算成本更低的情况下展现出优于现有方法的性能。综合所有改进,在多个公开数据集上的实验验证了所提模型的有效性,特别是在实际场景中性能显著提升。完整模型细节的配方已在 https://github.com/espnet/espnet 发布。