Automatic speech recognition (ASR) has gained a remarkable success thanks to recent advances of deep learning, but it usually degrades significantly under real-world noisy conditions. Recent works introduce speech enhancement (SE) as front-end to improve speech quality, which is proved effective but may not be optimal for downstream ASR due to speech distortion problem. Based on that, latest works combine SE and currently popular self-supervised learning (SSL) to alleviate distortion and improve noise robustness. Despite the effectiveness, the speech distortion caused by conventional SE still cannot be completely eliminated. In this paper, we propose a self-supervised framework named Wav2code to implement a generalized SE without distortions for noise-robust ASR. First, in pre-training stage the clean speech representations from SSL model are sent to lookup a discrete codebook via nearest-neighbor feature matching, the resulted code sequence are then exploited to reconstruct the original clean representations, in order to store them in codebook as prior. Second, during finetuning we propose a Transformer-based code predictor to accurately predict clean codes by modeling the global dependency of input noisy representations, which enables discovery and restoration of high-quality clean representations without distortions. Furthermore, we propose an interactive feature fusion network to combine original noisy and the restored clean representations to consider both fidelity and quality, resulting in even more informative features for downstream ASR. Finally, experiments on both synthetic and real noisy datasets demonstrate that Wav2code can solve the speech distortion and improve ASR performance under various noisy conditions, resulting in stronger robustness.
翻译:自动语音识别(ASR)得益于深度学习的近期进展取得了显著成功,但在真实噪声环境下其性能通常会大幅下降。近期研究引入语音增强作为前端模块以改善语音质量,这被证明是有效的,但由于语音失真问题,可能并非最优的下游ASR方案。基于此,最新研究结合语音增强与当前流行的自监督学习来减轻失真并提升噪声鲁棒性。尽管有效,但传统语音增强引起的语音失真仍无法完全消除。本文提出名为Wav2code的自监督框架,为噪声鲁棒ASR实现无失真的通用语音增强。首先,在预训练阶段,来自自监督学习模型的干净语音表征通过最近邻特征匹配查找离散码本,所得的码序列随后用于重建原始干净表征,从而将这些表征作为先验知识存储在码本中。其次,在微调阶段,我们提出基于Transformer的码预测器,通过建模输入噪声表征的全局依赖关系来准确预测干净码,从而能够发现并恢复高质量的无失真干净表征。此外,我们提出交互式特征融合网络,将原始噪声表征与恢复后的干净表征相结合,同时兼顾保真度和质量,为下游ASR提供更具信息量的特征。最后,在合成和真实噪声数据集上的实验表明,Wav2code能够解决语音失真问题,并在多种噪声条件下提升ASR性能,展现出更强的鲁棒性。