RNN-T is currently considered the industry standard in ASR due to its exceptional WERs in various benchmark tests and its ability to support seamless streaming and longform transcription. However, its biggest drawback lies in the significant discrepancy between its training and inference objectives. During training, RNN-T maximizes all alignment probabilities by teacher forcing, while during inference, it uses beam search which may not necessarily find the maximum probable alignment. Additionally, RNN-T's inability to experience mistakes during teacher forcing training makes it more problematic when a mistake occurs in inference. To address this issue, this paper proposes a Reinforcement Learning method that minimizes the gap between training and inference time. Our Edit Distance based RL (EDRL) approach computes rewards based on the edit distance, and trains the network at every action level. The proposed approach yielded SoTA WERs on LibriSpeech for the 600M Conformer RNN-T model.
翻译:RNN-T因其在各种基准测试中卓越的词错误率(WER)以及对无缝流式处理和长文本转录的支持能力,目前被视为自动语音识别领域的行业标准。然而,其最大缺陷在于训练目标与推理目标之间存在显著差异:训练时通过教师强制最大化所有对齐概率,而推理时使用的束搜索未必能找出最大概率对齐。此外,由于训练中的教师强制机制使模型无法体验错误,推理阶段一旦出错问题更为严重。针对这一挑战,本文提出一种最小化训练与推理阶段差异的强化学习方法:基于编辑距离的强化学习(EDRL)方法以编辑距离为基准计算奖励,并在每个动作层面训练网络。该方案在LibriSpeech数据集上,针对6亿参数的Conformer RNN-T模型取得了当前最优(SoTA)的词错误率。