This paper presents an end-to-end model designed to improve automatic speech recognition (ASR) for a particular speaker in a crowded, noisy environment. The model utilizes a single-channel speech enhancement module that isolates the speaker's voice from background noise, along with an ASR module. Through this approach, the model is able to decrease the word error rate (WER) of ASR from 80% to 26.4%. Typically, these two components are adjusted independently due to variations in data requirements. However, speech enhancement can create anomalies that decrease ASR efficiency. By implementing a joint fine-tuning strategy, the model can reduce the WER from 26.4% in separate tuning to 14.5% in joint tuning.
翻译:本文提出了一种端到端模型,旨在改善嘈杂环境下特定说话人的自动语音识别(ASR)性能。该模型利用单通道语音增强模块从背景噪声中分离说话人语音,并结合ASR模块。通过该方法,模型能将ASR的词错误率(WER)从80%降低至26.4%。通常,由于数据需求差异,这两个组件会被独立调整。然而,语音增强可能产生降低ASR效率的异常现象。通过实施联合微调策略,该模型可将WER从单独调整时的26.4%进一步降至联合调整时的14.5%。