Transducer neural networks have emerged as the mainstream approach for streaming automatic speech recognition (ASR), offering state-of-the-art performance in balancing accuracy and latency. In the conventional framework, streaming transducer models are trained to maximize the likelihood function based on non-streaming recursion rules. However, this approach leads to a mismatch between training and inference, resulting in the issue of deformed likelihood and consequently suboptimal ASR accuracy. We introduce a mathematical quantification of the gap between the actual likelihood and the deformed likelihood, namely forward variable causal compensation (FoCC). We also present its estimator, FoCCE, as a solution to estimate the exact likelihood. Through experiments on the LibriSpeech dataset, we show that FoCCE training improves the accuracy of the streaming transducers.
翻译:Transducer神经网络已成为流式自动语音识别(ASR)的主流方法,在平衡准确性与延迟方面提供了最先进的性能。在传统框架中,流式Transducer模型的训练旨在基于非流式递归规则最大化似然函数。然而,这种方法导致训练与推理之间存在不匹配,从而引发似然变形问题,并最终导致次优的ASR准确率。我们引入了一种对实际似然与变形似然之间差距的数学量化,即前向变量因果补偿(FoCC)。我们还提出了其估计器FoCCE,作为估计精确似然的一种解决方案。通过在LibriSpeech数据集上的实验,我们证明了FoCCE训练能够提升流式Transducer的准确率。