ASR models are commonly trained with the cross-entropy criterion to increase the probability of a target token sequence. While optimizing the probability of all tokens in the target sequence is sensible, one may want to de-emphasize tokens that reflect transcription errors. In this work, we propose a novel token-weighted RNN-T criterion that augments the RNN-T objective with token-specific weights. The new objective is used for mitigating accuracy loss from transcriptions errors in the training data, which naturally appear in two settings: pseudo-labeling and human annotation errors. Experiments results show that using our method for semi-supervised learning with pseudo-labels leads to a consistent accuracy improvement, up to 38% relative. We also analyze the accuracy degradation resulting from different levels of WER in the reference transcription, and show that token-weighted RNN-T is suitable for overcoming this degradation, recovering 64%-99% of the accuracy loss.
翻译:自动语音识别模型通常通过交叉熵准则进行训练,以提升目标词元序列的概率。虽然优化目标序列中所有词元的概率是合理的,但人们可能希望降低反映转录错误的词元的权重。在本工作中,我们提出了一种新颖的词元加权RNN-T准则,该准则通过词元特定权重增强了RNN-T目标函数。这一新目标旨在缓解训练数据中转录错误导致的精度损失,此类错误自然出现在两种场景中:伪标注过程和人工标注错误。实验结果表明,在基于伪标签的半监督学习中使用我们的方法能够带来一致的精度提升,相对提升最高可达38%。我们还分析了参考转录中不同词错误率水平导致的精度下降,并证明词元加权RNN-T能够有效克服这种性能退化,可恢复64%-99%的精度损失。