In the realm of automatic speech recognition (ASR), robustness in noisy environments remains a significant challenge. Recent ASR models, such as Whisper, have shown promise, but their efficacy in noisy conditions can be further enhanced. This study is focused on recovering from packet loss to improve the word error rate (WER) of ASR models. We propose using a front-end adaptation network connected to a frozen ASR model. The adaptation network is trained to modify the corrupted input spectrum by minimizing the criteria of the ASR model in addition to an enhancement loss function. Our experiments demonstrate that the adaptation network, trained on Whisper's criteria, notably reduces word error rates across domains and languages in packet-loss scenarios. This improvement is achieved with minimal affect to Whisper model's foundational performance, underscoring our method's practicality and potential in enhancing ASR models in challenging acoustic environments.
翻译:在自动语音识别(ASR)领域,噪声环境下的鲁棒性仍是一个重大挑战。近期诸如Whisper等ASR模型展现出良好潜力,但其在噪声条件下的性能仍有提升空间。本研究聚焦于通过恢复丢包数据来改善ASR模型的词错误率(WER)。我们提出一种连接冻结ASR模型的前端自适应网络,该网络通过最小化ASR模型准则与增强损失函数相结合的方式,训练修正受损输入频谱。实验表明,基于Whisper准则训练的自适应网络能显著降低跨领域、跨语言场景下丢包时的词错误率。这种改进对Whisper模型的基础性能影响极小,凸显了本方法在挑战性声学环境中增强ASR模型的实用价值与发展潜力。