Training a semi-supervised end-to-end speech recognition system using noisy student training has significantly improved performance. However, this approach requires a substantial amount of paired speech-text and unlabeled speech, which is costly for low-resource languages. Therefore, this paper considers a more extreme case of semi-supervised end-to-end automatic speech recognition where there are limited paired speech-text, unlabeled speech (less than five hours), and abundant external text. Firstly, we observe improved performance by training the model using our previous work on semi-supervised learning "CycleGAN and inter-domain losses" solely with external text. Secondly, we enhance "CycleGAN and inter-domain losses" by incorporating automatic hyperparameter tuning, calling it "enhanced CycleGAN inter-domain losses." Thirdly, we integrate it into the noisy student training approach pipeline for low-resource scenarios. Our experimental results, conducted on six non-English languages from Voxforge and Common Voice, show a 20% word error rate reduction compared to the baseline teacher model and a 10% word error rate reduction compared to the baseline best student model, highlighting the significant improvements achieved through our proposed method.
翻译:采用噪声学生训练的半监督端到端语音识别系统已显著提升性能。然而,该方法需要大量配对语音-文本数据与无标注语音,对低资源语言而言成本高昂。为此,本文研究一种更极端的半监督端到端自动语音识别场景:配对语音-文本数据有限、无标注语音不足五小时,但外部文本资源充足。首先,我们观察到仅使用外部文本数据,通过先前提出的半监督学习方法"CycleGAN与跨域损失"训练模型即可提升性能。其次,我们通过引入自动超参数调优机制增强该方法,将其命名为"增强型CycleGAN跨域损失"。最后,我们将该方法整合至低资源场景下的噪声学生训练流程中。在Voxforge与Common Voice数据集的六种非英语语言上进行的实验表明:相较于基线教师模型,该方法实现了20%的词错误率降低;相较于基线最优学生模型,实现了10%的词错误率降低,凸显了所提方法带来的显著性能提升。