RNN-T is currently considered the industry standard in ASR due to its exceptional WERs in various benchmark tests and its ability to support seamless streaming and longform transcription. However, its biggest drawback lies in the significant discrepancy between its training and inference objectives. During training, RNN-T maximizes all alignment probabilities by teacher forcing, while during inference, it uses beam search which may not necessarily find the maximum probable alignment. Additionally, RNN-T's inability to experience mistakes during teacher forcing training makes it more problematic when a mistake occurs in inference. To address this issue, this paper proposes a Reinforcement Learning method that minimizes the gap between training and inference time. Our Edit Distance based RL (EDRL) approach computes rewards based on the edit distance, and trains the network at every action level. The proposed approach yielded SoTA WERs on LibriSpeech for the 600M Conformer RNN-T model.
翻译:RNN-T凭借其在多项基准测试中卓越的词错误率和支持无缝流式与长文本转录的能力,目前被视为自动语音识别领域的行业标准。然而,其最大缺陷在于训练目标与推理目标之间存在显著差异。在训练阶段,RNN-T通过教师强制最大化所有对齐概率;而在推理阶段,采用的束搜索未必能找到最大概率对齐。此外,由于教师强制训练过程中模型无法经历错误,当推理阶段出现错误时问题更为突出。针对这一挑战,本文提出了一种强化学习方法来缩小训练与推理阶段的差距。我们提出的基于编辑距离的强化学习方法以编辑距离为基准计算奖励,并在每个动作层级训练网络。该方案在LibriSpeech数据集上使6亿参数Conformer RNN-T模型取得了最先进的词错误率表现。