A hybrid autoregressive transducer (HAT) is a variant of neural transducer that models blank and non-blank posterior distributions separately. In this paper, we propose a novel internal acoustic model (IAM) training strategy to enhance HAT-based speech recognition. IAM consists of encoder and joint networks, which are fully shared and jointly trained with HAT. This joint training not only enhances the HAT training efficiency but also encourages IAM and HAT to emit blanks synchronously which skips the more expensive non-blank computation, resulting in more effective blank thresholding for faster decoding. Experiments demonstrate that the relative error reductions of the HAT with IAM compared to the vanilla HAT are statistically significant. Moreover, we introduce dual blank thresholding, which combines both HAT- and IAM-blank thresholding and a compatible decoding algorithm. This results in a 42-75% decoding speed-up with no major performance degradation.
翻译:混合自回归换能器(HAT)是神经换能器的一种变体,其将空白与非空白后验分布分开建模。本文提出一种新颖的内部声学模型(IAM)训练策略,以增强基于HAT的语音识别性能。IAM由编码器和联合网络组成,这些网络与HAT完全共享并联合训练。这种联合训练不仅提高了HAT的训练效率,还促使IAM与HAT同步发射空白符号,从而跳过计算代价更高的非空白部分,实现更有效的空白阈值化以加速解码。实验表明,相较于原始HAT,引入IAM的HAT在错误率降低方面取得了统计显著的相对改进。此外,我们提出了双重空白阈值化方法,该方法结合了HAT与IAM的空白阈值化以及兼容的解码算法。这实现了42-75%的解码速度提升,且未造成明显的性能下降。