Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition. When using appropriate modeling units, e.g., byte-pair encoding, these systems are in principle open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, acronyms, or domain-specific special words. To address this problem, many context biasing methods have been proposed; however, these methods may still struggle when they are unable to relate audio and corresponding text, e.g., in case of a pronunciation-orthography mismatch. We propose a method where corrections of substitution errors can be used to improve the recognition accuracy of such challenging words. Users can add corrections on the fly during inference. We show that with this method we get a relative improvement in biased word error rate between 22% and 34% compared to a text-based replacement method, while maintaining the overall performance.
翻译:神经序列到序列系统为自动语音识别提供了最先进的性能。当使用适当的建模单元(例如字节对编码)时,这些系统原则上属于开放词汇系统。然而在实践中,它们往往无法识别训练过程中未见过的词汇,例如命名实体、首字母缩略词或特定领域的专业术语。针对这一问题,研究者提出了多种上下文偏置方法;但当音频与对应文本无法建立关联时(例如出现发音-拼写不匹配的情况),这些方法仍可能失效。本文提出一种通过修正替换错误来提升此类疑难词汇识别准确率的方法。用户可在推理过程中实时添加修正信息。实验表明,相较于基于文本的替换方法,本方法在保持整体性能的同时,能将偏置词错误率相对降低22%至34%。