Building a single universal speech enhancement (SE) system that can handle arbitrary input is a demanded but underexplored research topic. Towards this ultimate goal, one direction is to build a single model that handles diverse audio duration, sampling frequencies, and microphone variations in noisy and reverberant scenarios, which we define here as "input condition invariant SE". Such a model was recently proposed showing promising performance; however, its multi-channel performance degraded severely in real conditions. In this paper we propose novel architectures to improve the input condition invariant SE model so that performance in simulated conditions remains competitive while real condition degradation is much mitigated. For this purpose, we redesign the key components that comprise such a system. First, we identify that the channel-modeling module's generalization to unseen scenarios can be sub-optimal and redesign this module. We further introduce a two-stage training strategy to enhance training efficiency. Second, we propose two novel dual-path time-frequency blocks, demonstrating superior performance with fewer parameters and computational costs compared to the existing method. All proposals combined, experiments on various public datasets validate the efficacy of the proposed model, with significantly improved performance on real conditions. Recipe with full model details is released at https://github.com/espnet/espnet.
翻译:构建一个能够处理任意输入的通用语音增强(SE)系统虽备受期待,却是一个探索不足的研究方向。为实现这一终极目标,一个途径是构建单个模型,使其能在噪声和混响场景中处理多样化的音频时长、采样频率及麦克风变化——我们在此将其定义为“输入条件不变的语音增强”。近期提出的此类模型展现了令人期待的性能,但其多通道性能在实际条件下严重下降。本文提出新型架构以改进该输入条件不变的语音增强模型,使其在模拟条件下保持竞争力,同时大幅缓解实际条件下的性能退化。为此,我们重新设计了该系统的关键组件。首先,我们发现通道建模模块对未知场景的泛化能力可能欠佳,因此重新设计了该模块,并引入两阶段训练策略以提升训练效率。其次,我们提出两种新颖的双路径时频模块,与现有方法相比,能以更少的参数量和计算成本实现更优性能。综合所有改进,在多个公开数据集上的实验验证了所提模型的有效性,尤其在实际条件下性能显著提升。包含完整模型细节的实现方案已发布至 https://github.com/espnet/espnet。