In recent years, end-to-end automatic speech recognition (ASR) systems have proven themselves remarkably accurate and performant, but these systems still have a significant error rate for entity names which appear infrequently in their training data. In parallel to the rise of end-to-end ASR systems, large language models (LLMs) have proven to be a versatile tool for various natural language processing (NLP) tasks. In NLP tasks where a database of relevant knowledge is available, retrieval augmented generation (RAG) has achieved impressive results when used with LLMs. In this work, we propose a RAG-like technique for correcting speech recognition entity name errors. Our approach uses a vector database to index a set of relevant entities. At runtime, database queries are generated from possibly errorful textual ASR hypotheses, and the entities retrieved using these queries are fed, along with the ASR hypotheses, to an LLM which has been adapted to correct ASR errors. Overall, our best system achieves 33%-39% relative word error rate reductions on synthetic test sets focused on voice assistant queries of rare music entities without regressing on the STOP test set, a publicly available voice assistant test set covering many domains.
翻译:近年来,端到端自动语音识别(ASR)系统已展现出卓越的准确性与性能,但对于训练数据中出现频率较低的实体名称,这些系统仍存在较高的错误率。随着端到端ASR系统的发展,大语言模型(LLM)已被证明是处理各类自然语言处理(NLP)任务的多功能工具。在具备相关知识数据库的NLP任务中,检索增强生成(RAG)技术与LLM结合已取得显著成果。本研究提出一种类RAG技术,用于纠正语音识别中的实体名称错误。该方法利用向量数据库对相关实体集合进行索引。在运行时,从可能存在错误的文本ASR假设生成数据库查询,检索到的实体与ASR假设共同输入至经过适配的LLM中进行ASR错误校正。总体而言,在专注于罕见音乐实体语音助手查询的合成测试集上,我们最优系统实现了33%-39%的相对词错误率降低,同时在覆盖多领域的公开语音助手测试集STOP上未出现性能衰退。