Universal speech enhancement aims at handling inputs with various speech distortions and recording conditions. In this work, we propose a novel hybrid architecture that synergizes the signal fidelity of discriminative modeling with the reconstruction capabilities of generative modeling. Our system utilizes the discriminative TF-GridNet model with the Sampling-Frequency-Independent strategy to handle variable sampling rates universally. In parallel, an autoregressive model combined with spectral mapping modeling generates detail-rich speech while effectively suppressing generative artifacts. Finally, a fusion network learns adaptive weights of the two outputs under the optimization of signal-level losses and the comprehensive Speech Quality Assessment (SQA) loss. Our proposed system is evaluated in the ICASSP 2026 URGENT Challenge (Track 1) and ranks the third place.
翻译:通用语音增强旨在处理具有多种语音失真和录制条件的输入信号。本研究提出了一种新颖的混合架构,该架构将判别式建模的信号保真度优势与生成式建模的重建能力相协同。我们的系统采用具有采样频率无关策略的判别式TF-GridNet模型,以通用方式处理可变采样率。同时,结合频谱映射建模的自回归模型能够生成细节丰富的语音,并有效抑制生成式伪影。最后,在信号级损失与综合语音质量评估损失的联合优化下,一个融合网络学习两个输出分支的自适应权重。我们所提出的系统在ICASSP 2026 URGENT挑战赛(赛道一)中进行了评估,并取得了第三名的成绩。