End-to-end (E2E) approach is gradually replacing hybrid models for automatic speech recognition (ASR) tasks. However, the optimization of E2E models lacks an intuitive method for handling decoding shifts, especially in scenarios with a large number of domain-specific rare words that hold specific important meanings. Furthermore, the absence of knowledge-intensive speech datasets in academia has been a significant limiting factor, and the commonly used speech corpora exhibit significant disparities with realistic conversation. To address these challenges, we present Medical Interview (MED-IT), a multi-turn consultation speech dataset that contains a substantial number of knowledge-intensive named entities. We also explore methods to enhance the recognition performance of rare words for E2E models. We propose a novel approach, post-decoder biasing, which constructs a transform probability matrix based on the distribution of training transcriptions. This guides the model to prioritize recognizing words in the biasing list. In our experiments, for subsets of rare words appearing in the training speech between 10 and 20 times, and between 1 and 5 times, the proposed method achieves a relative improvement of 9.3% and 5.1%, respectively.
翻译:端到端(E2E)方法正逐渐取代混合模型,成为自动语音识别(ASR)任务的主流方案。然而,E2E模型的优化缺乏处理解码偏移的直观方法,尤其在涉及大量具有特定重要含义的领域罕见词时尤为突出。此外,学术界缺乏知识密集型语音数据集是重要限制因素,且常用语音语料库与现实对话存在显著差异。为解决这些挑战,我们提出了医疗访谈(MED-IT)数据集,这是一个包含大量知识密集型命名实体的多轮咨询语音数据集。我们还探索了增强E2E模型罕见词识别性能的方法,并提出了一种新型后解码器偏置方法。该方法基于训练转录本的分布构建转移概率矩阵,引导模型优先识别偏置列表中的词汇。实验结果表明,针对训练语音中出现10至20次和1至5次的罕见词子集,所提方法分别实现了9.3%和5.1%的相对性能提升。