Noise robustness is critical when applying automatic speech recognition (ASR) in real-world scenarios. One solution involves the used of speech enhancement (SE) models as the front end of ASR. However, neural network-based (NN-based) SE often introduces artifacts into the enhanced signals and harms ASR performance, particularly when SE and ASR are independently trained. Therefore, this study introduces a simple yet effective SE post-processing technique to address the gap between various pre-trained SE and ASR models. A bridge module, which is a lightweight NN, is proposed to evaluate the signal-level information of the speech signal. Subsequently, using the signal-level information, the observation addition technique is applied to effectively reduce the shortcomings of SE. The experimental results demonstrate the success of our method in integrating diverse pre-trained SE and ASR models, considerably boosting the ASR robustness. Crucially, no prior knowledge of the ASR or speech contents is required during the training or inference stages. Moreover, the effectiveness of this approach extends to different datasets without necessitating the fine-tuning of the bridge module, ensuring efficiency and improved generalization.
翻译:在实际场景中应用自动语音识别(ASR)时,噪声鲁棒性至关重要。一种解决方案是将语音增强(SE)模型用作ASR的前端。然而,基于神经网络(NN-based)的SE常常会在增强信号中引入伪影,损害ASR性能,尤其是在SE与ASR独立训练的情况下。因此,本研究提出了一种简单而有效的SE后处理技术,以解决各种预训练SE与ASR模型之间的差距。我们提出了一个轻量级神经网络作为桥接模块,用于评估语音信号的信号级信息。随后,利用该信号级信息,应用观测叠加技术有效减少SE的缺陷。实验结果表明,我们的方法成功整合了多种预训练SE与ASR模型,显著提升了ASR的鲁棒性。关键的是,在训练或推理阶段均无需ASR或语音内容的先验知识。此外,该方法在不同数据集上均展现出有效性,且无需对桥接模块进行微调,从而确保了效率并提升了泛化能力。