Pre-trained models for automatic speech recognition (ASR) and speech enhancement (SE) have exhibited remarkable capabilities under matched noise and channel conditions. However, these models often suffer from severe performance degradation when confronted with domain shifts, particularly in the presence of unseen noise and channel distortions. In view of this, we in this paper present URSA-GAN, a unified and domain-aware generative framework specifically designed to mitigate mismatches in both noise and channel conditions. URSA-GAN leverages a dual-embedding architecture that consists of a noise encoder and a channel encoder, each pre-trained with limited in-domain data to capture domain-relevant representations. These embeddings condition a GAN-based speech generator, facilitating the synthesis of speech that is acoustically aligned with the target domain while preserving phonetic content. To enhance generalization further, we propose dynamic stochastic perturbation, a novel regularization technique that introduces controlled variability into the embeddings during generation, promoting robustness to unseen domains. Empirical results demonstrate that URSA-GAN effectively reduces character error rates in ASR and improves perceptual metrics in SE across diverse noisy and mismatched channel scenarios. Notably, evaluations on compound test conditions with both channel and noise degradations confirm the generalization ability of URSA-GAN, yielding relative improvements of 16.16% in ASR performance and 15.58% in SE metrics.
翻译:自动语音识别与语音增强的预训练模型在匹配的噪声与信道条件下已展现出卓越性能。然而,当面临领域偏移时,尤其是在未见的噪声和信道失真条件下,这些模型常出现严重的性能下降。针对此问题,本文提出URSA-GAN——一个统一且具备领域感知的生成式框架,专门用于缓解噪声与信道条件不匹配问题。URSA-GAN采用双嵌入架构,包含分别通过有限领域内数据预训练的噪声编码器与信道编码器,以捕获领域相关表征。这些嵌入向量作为条件输入至基于GAN的语音生成器,促使其合成在声学特征上与目标领域对齐、同时保持语音内容的语音信号。为进一步增强泛化能力,我们提出动态随机扰动技术——一种新颖的正则化方法,通过在生成过程中向嵌入向量引入受控的随机性,提升对未见领域的鲁棒性。实验结果表明,URSA-GAN能有效降低多种噪声与不匹配信道场景下ASR的字错误率,并提升SE的感知评估指标。值得注意的是,在信道与噪声双重退化的复合测试条件下,URSA-GAN展现出显著的泛化能力,使ASR性能相对提升16.16%,SE指标相对提升15.58%。