Speech signals in real-world environments are frequently affected by various distortions such as additive noise, reverberation, and bandwidth limitation, which may appear individually or in combination. Traditional speech enhancement methods typically rely on either masking, which focuses on suppressing non-speech components while preserving observable structure, or mapping, which seeks to recover clean speech through direct transformation of the input. Each approach offers strengths in specific scenarios but may be less effective outside its target conditions. We propose the Erase and Draw Network (EDNet), a versatile speech enhancement framework designed to handle a broad range of distortion types without prior assumptions about task or input characteristics. EDNet consists of two main components: (1) the Gating Mamba (GM) module, which adaptively combines masking and mapping through a learnable gating mechanism that selects between suppression (Erase) and reconstruction (Draw) based on local signal features, and (2) Phase Shift-Invariant Training (PSIT), a shift tolerant supervision strategy that improves phase estimation by enabling dynamic alignment during training while remaining compatible with standard loss functions. Experimental results on denoising, dereverberation, bandwidth extension, and multi distortion enhancement tasks show that EDNet consistently achieves strong performance across conditions, demonstrating its architectural flexibility and adaptability to diverse task settings.
翻译:现实环境中的语音信号常受到加性噪声、混响及带宽限制等多种失真的影响,这些失真可能单独或组合出现。传统语音增强方法通常依赖于掩蔽(侧重于抑制非语音成分同时保留可观测结构)或映射(通过对输入的直接变换来恢复纯净语音)。每种方法在特定场景中具有优势,但在其目标条件之外可能效果有限。本文提出擦除-绘制网络(EDNet),这是一种通用语音增强框架,旨在无需对任务或输入特性进行先验假设的情况下处理广泛的失真类型。EDNet包含两个核心组件:(1)门控Mamba(GM)模块,通过可学习的门控机制自适应地结合掩蔽与映射,该机制根据局部信号特征在抑制(擦除)与重建(绘制)之间进行选择;(2)相位平移不变性训练(PSIT),一种平移容忍的监督策略,通过在训练期间实现动态对齐来改进相位估计,同时保持与标准损失函数的兼容性。在去噪、去混响、带宽扩展及多重失真增强任务上的实验结果表明,EDNet在不同条件下均能取得强劲性能,展现了其架构的灵活性与对多样化任务场景的适应能力。