Recent speech enhancement (SE) models increasingly leverage self-supervised learning (SSL) representations for their rich semantic information. Typically, intermediate features are aggregated into a single representation via a lightweight adaptation module. However, most SSL models are not trained for noise robustness, which can lead to corrupted semantic representations. Moreover, the adaptation module is trained jointly with the SE model, potentially prioritizing acoustic details over semantic information, contradicting the original purpose. To address this issue, we first analyze the behavior of SSL models on noisy speech from an information-theoretic perspective. Specifically, we measure the mutual information (MI) between the corrupted SSL representations and the corresponding phoneme labels, focusing on preservation of linguistic contents. Building upon this analysis, we introduce the linguistic aggregation layer, which is pre-trained to maximize MI with phoneme labels (with optional dynamic aggregation) and then frozen during SE training. Experiments show that this decoupled approach improves Word Error Rate (WER) over jointly optimized baselines, demonstrating the benefit of explicitly aligning the adaptation module with linguistic contents.
翻译:近年来,语音增强模型越来越多地利用自监督学习表征中丰富的语义信息。通常,中间特征通过轻量化的适配模块聚合成单一表征。然而,大多数SSL模型并非针对噪声鲁棒性进行训练,这可能导致语义表征的损坏。此外,适配模块与SE模型联合训练时,可能优先考虑声学细节而非语义信息,这与原始目标相悖。为解决这一问题,我们首先从信息论角度分析SSL模型在带噪语音上的行为。具体而言,我们测量了受损SSL表征与对应音素标签之间的互信息,重点关注语言内容的保持。基于此分析,我们提出了语言聚合层,该层通过预训练最大化与音素标签的互信息(可选动态聚合),随后在SE训练期间保持冻结。实验表明,这种解耦方法在词错误率上优于联合优化的基线模型,证明了将适配模块与语言内容显式对齐的有效性。