End-to-end automatic speech recognition (E2E ASR) systems have significantly improved speech recognition through training on extensive datasets. Despite these advancements, they still struggle to accurately recognize domain specific words, such as proper nouns and technical terminologies. To address this problem, we propose a method to utilize the state-of-the-art Whisper without modifying its architecture, preserving its generalization performance while enabling it to leverage descriptions effectively. Moreover, we propose two additional training techniques to improve the domain specific ASR: decoder fine-tuning, and context perturbation. We also propose a method to use a Large Language Model (LLM) to generate descriptions with simple metadata, when descriptions are unavailable. Our experiments demonstrate that proposed methods notably enhance domain-specific ASR accuracy on real-life datasets, with LLM-generated descriptions outperforming human-crafted ones in effectiveness.
翻译:端到端自动语音识别系统通过在大规模数据集上的训练显著提升了语音识别性能。然而,尽管取得了这些进展,此类系统在准确识别领域特定词汇(如专有名词和技术术语)方面仍面临挑战。为解决这一问题,我们提出一种方法,在不修改Whisper模型架构的前提下利用其先进性能,既保持其泛化能力,又能有效利用上下文描述信息。此外,我们提出两种额外的训练技术以提升领域特定ASR性能:解码器微调与上下文扰动。针对描述信息缺失的场景,我们还提出一种利用大型语言模型基于简单元数据自动生成描述的方法。实验结果表明,所提方法在真实数据集上显著提升了领域特定ASR的准确率,且LLM生成的描述在效果上优于人工撰写的描述。