Advancements in monaural speech enhancement (SE) techniques have greatly improved the perceptual quality of speech. However, integrating these techniques into automatic speech recognition (ASR) systems has not yielded the expected performance gains, primarily due to the introduction of distortions during the SE process. In this paper, we propose a novel approach called FAT-HuBERT, which leverages distortion-invariant self-supervised learning (SSL) to enhance the robustness of ASR. To address the distortions introduced by the SE frontends, we introduce layer-wise fusion modules that incorporate features extracted from both observed noisy signals and enhanced signals. During training, the SE frontend is randomly selected from a pool of models. We evaluate the performance of FAT-HuBERT on simulated noisy speech generated from LibriSpeech as well as real-world noisy speech from the CHiME-4 1-channel dataset. The experimental results demonstrate a significant relative reduction in word error rate (WER).
翻译:单声道语音增强(SE)技术的进步显著提升了语音的感知质量。然而,将这些技术集成到自动语音识别(ASR)系统中并未带来预期的性能提升,这主要源于SE过程中引入的畸变。本文提出一种名为FAT-HuBERT的新方法,利用畸变不变自监督学习(SSL)增强ASR的鲁棒性。为应对SE前端引入的畸变,我们引入了分层融合模块,将来自观测含噪信号和增强信号的特征进行融合。训练时,SE前端从模型池中随机选取。我们在基于LibriSpeech生成的仿真含噪语音以及CHiME-4 1通道数据集中的真实含噪语音上评估了FAT-HuBERT的性能。实验结果表明,词错误率(WER)实现了显著相对降低。