Deep biasing (DB) improves the performance of end-to-end automatic speech recognition (E2E-ASR) for rare words or contextual phrases using a bias list. However, most existing methods treat bias phrases as sequences of subwords in a predefined static vocabulary, which can result in ineffective learning of the dependencies between subwords. More advanced techniques address this problem by incorporating additional text data, which increases the overall workload. This paper proposes a dynamic vocabulary where phrase-level bias tokens can be added during the inference phase. Each bias token represents an entire bias phrase within a single token, thereby eliminating the need to learn the dependencies between the subwords within the bias phrases. This method can be applied to various architectures because it only extends the embedding and output layers in common E2E-ASR architectures. Experimental results demonstrate that the proposed method improves the performance of bias phrases on English and Japanese datasets.
翻译:深度偏置(DB)技术通过使用偏置列表,提升了端到端自动语音识别(E2E-ASR)系统对罕见词或上下文短语的识别性能。然而,现有方法大多将偏置短语视为预定义静态词汇表中的子词序列进行处理,这可能导致子词间依赖关系的学习效率低下。更先进的技术通过引入额外的文本数据来解决此问题,但增加了整体工作量。本文提出了一种动态词汇表方法,可在推理阶段添加短语级的偏置标记。每个偏置标记代表一个完整的偏置短语,从而无需学习偏置短语内部子词间的依赖关系。由于该方法仅扩展了常见E2E-ASR架构中的嵌入层和输出层,因此可适用于多种架构。实验结果表明,所提方法在英语和日语数据集上均提升了偏置短语的识别性能。