This study explores how bilingual fine-tuning affects automatic speech recognition (ASR) in low-resource languages. We evaluate this method across nine linguistically and geographically diverse language pairs, covering a range of language families and writing systems. To distinguish the two languages, during training, we pre-pend each input text with a language identification token. At inference, the model jointly predicts both the language and transcription from the speech input alone. As texts for which the language is incorrectly determined show low ASR performance, we also conduct a follow-up experiment in which the language identification token is provided both during training and inference. Our results show that bilingual fine-tuning can be beneficial when language identification accuracy is high, and that in cases where language identification performance is low, including the language identification token at inference helps to improve ASR performance.
翻译:本研究探讨双语微调对低资源语言自动语音识别(ASR)的影响。我们通过覆盖九种语言与地理分布多样、涉及不同语系与书写系统的语言对,对该方法进行了评估。为区分两种语言,训练时我们在每个输入文本前预置语言识别标记。推理阶段,模型仅依据语音输入联合预测语言和转录结果。由于语言识别错误的文本ASR性能较低,我们进一步开展了一项后续实验,在训练和推理阶段均提供语言识别标记。结果表明:当语言识别准确率较高时,双语微调效果显著;而在语言识别性能较低的情况下,推理阶段引入语言识别标记有助于提升ASR性能。