Existing deep learning based speech enhancement mainly employ a data-driven approach, which leverage large amounts of data with a variety of noise types to achieve noise removal from noisy signal. However, the high dependence on the data limits its generalization on the unseen complex noises in real-life environment. In this paper, we focus on the low-latency scenario and regard speech enhancement as a speech generation problem conditioned on the noisy signal, where we generate clean speech instead of identifying and removing noises. Specifically, we propose a conditional generative framework for speech enhancement, which models clean speech by acoustic codes of a neural speech codec and generates the speech codes conditioned on past noisy frames in an auto-regressive way. Moreover, we propose an explicit-alignment approach to align noisy frames with the generated speech tokens to improve the robustness and scalability to different input lengths. Different from other methods that leverage multiple stages to generate speech codes, we leverage a single-stage speech generation approach based on the TF-Codec neural codec to achieve high speech quality with low latency. Extensive results on both synthetic and real-recorded test set show its superiority over data-driven approaches in terms of noise robustness and temporal speech coherence.
翻译:现有基于深度学习的语音增强主要采用数据驱动方法,通过利用大量包含多种噪声类型的数据实现从含噪信号中去除噪声。然而,这类方法对数据的强依赖性限制了其在真实环境中未见复杂噪声场景下的泛化能力。本文聚焦低延迟场景,将语音增强视为以含噪信号为条件的语音生成问题——通过直接生成纯净语音而非识别并去除噪声。具体而言,我们提出一种条件生成框架,其中利用神经语音编解码器的声学码对纯净语音进行建模,并以自回归方式根据历史含噪帧生成语音编码。此外,我们提出显式对齐方法,将含噪帧与生成的语音令牌对齐,以提升不同输入长度下的鲁棒性和可扩展性。与采用多阶段生成语音编码的方法不同,我们基于TF-Codec神经编解码器采用单阶段语音生成方法,在低延迟条件下实现高语音质量。在合成与真实录音测试集上的广泛结果表明,该方法在噪声鲁棒性和时域语音连贯性方面均优于数据驱动方法。