In this paper, we propose a Guided Attention (GA) auxiliary training loss, which improves the effectiveness and robustness of automatic speech recognition (ASR) contextual biasing without introducing additional parameters. A common challenge in previous literature is that the word error rate (WER) reduction brought by contextual biasing diminishes as the number of bias phrases increases. To address this challenge, we employ a GA loss as an additional training objective besides the Transducer loss. The proposed GA loss aims to teach the cross attention how to align bias phrases with text tokens or audio frames. Compared to studies with similar motivations, the proposed loss operates directly on the cross attention weights and is easier to implement. Through extensive experiments based on Conformer Transducer with Contextual Adapter, we demonstrate that the proposed method not only leads to a lower WER but also retains its effectiveness as the number of bias phrases increases. Specifically, the GA loss decreases the WER of rare vocabularies by up to 19.2% on LibriSpeech compared to the contextual biasing baseline, and up to 49.3% compared to a vanilla Transducer.
翻译:本文提出了一种引导注意力(GA)辅助训练损失,该损失可在不引入额外参数的情况下提升自动语音识别(ASR)上下文偏置的有效性与鲁棒性。现有文献面临的普遍挑战是:随着偏置短语数量的增加,上下文偏置带来的词错误率(WER)降幅逐渐减小。为应对这一挑战,我们将GA损失作为除Transducer损失之外的额外训练目标。所提出的GA损失旨在指导交叉注意力机制学习如何将偏置短语与文本标记或音频帧对齐。与动机相近的研究相比,该损失直接作用于交叉注意力权重且更易实现。通过在搭配上下文适配器(Contextual Adapter)的Conformer Transducer模型上进行大量实验,我们证明所提方法不仅能降低WER,还能在偏置短语数量增加时保持有效性。具体而言,在LibriSpeech数据集上,与上下文偏置基线相比,GA损失使稀有词汇的WER降低高达19.2%,与标准Transducer相比则降低高达49.3%。