End-to-end multilingual speech recognition models handle multiple languages through a single model, often incorporating language identification to automatically detect the language of incoming speech. Since the common scenario is where the language is already known, these models can perform as language-specific by using language information as prompts, which is particularly beneficial for attention-based encoder-decoder architectures. However, the Connectionist Temporal Classification (CTC) approach, which enhances recognition via joint decoding and multi-task training, does not normally incorporate language prompts due to its conditionally independent output tokens. To overcome this, we introduce an encoder prompting technique within the self-conditioned CTC framework, enabling language-specific adaptation of the CTC model in a zero-shot manner. Our method has shown to significantly reduce errors by 28% on average and by 41% on low-resource languages.
翻译:端到端多语言语音识别模型通过单一模型处理多种语言,通常结合语言识别功能来自动检测输入语音的语言类型。由于实际场景中语言信息通常已知,这类模型可将语言信息作为提示来实现语言特定识别,这对基于注意力的编码器-解码器架构尤为有益。然而,基于连接时序分类(CTC)的方法通过联合解码和多任务训练提升识别性能,由于其输出标记的条件独立性,通常无法整合语言提示。为解决此问题,我们在自条件CTC框架中引入编码器提示技术,使CTC模型能够以零样本方式进行语言特定适应。实验表明,该方法平均降低错误率28%,在低资源语言上错误率降低达41%。