End-to-end (E2E) automatic speech recognition (ASR) methods exhibit remarkable performance. However, since the performance of such methods is intrinsically linked to the context present in the training data, E2E-ASR methods do not perform as desired for unseen user contexts (e.g., technical terms, personal names, and playlists). Thus, E2E-ASR methods must be easily contextualized by the user or developer. This paper proposes an attention-based contextual biasing method that can be customized using an editable phrase list (referred to as a bias list). The proposed method can be trained effectively by combining a bias phrase index loss and special tokens to detect the bias phrases in the input speech data. In addition, to improve the contextualization performance during inference further, we propose a bias phrase boosted (BPB) beam search algorithm based on the bias phrase index probability. Experimental results demonstrate that the proposed method consistently improves the word error rate and the character error rate of the target phrases in the bias list on both the Librispeech-960 (English) and our in-house (Japanese) dataset, respectively.
翻译:端到端(E2E)自动语音识别(ASR)方法展现出卓越的性能。然而,由于此类方法的性能本质上受限于训练数据中的上下文,E2E-ASR方法在应对未见过的用户上下文(例如技术术语、人名和播放列表)时表现不佳。因此,E2E-ASR方法需要能够被用户或开发者轻松地上下文化。本文提出一种基于注意力的上下文化偏置方法,可通过可编辑的短语列表(称为偏置列表)进行定制。该方法通过结合偏置短语索引损失和用于检测输入语音数据中偏置短语的特殊标记,能够进行有效训练。此外,为了进一步提升推理阶段的上下文化性能,我们提出一种基于偏置短语索引概率的偏置短语增强(BPB)波束搜索算法。实验结果表明,所提出的方法分别在Librispeech-960(英语)和内部(日语)数据集上,显著改善了偏置列表中目标短语的词错误率和字符错误率。