We present \textbf{H}ybrid-\textbf{A}utoregressive \textbf{IN}ference Tr\textbf{AN}sducers (HAINAN), a novel architecture for speech recognition that extends the Token-and-Duration Transducer (TDT) model. Trained with randomly masked predictor network outputs, HAINAN supports both autoregressive inference with all network components and non-autoregressive inference without the predictor. Additionally, we propose a novel semi-autoregressive inference paradigm that first generates an initial hypothesis using non-autoregressive inference, followed by refinement steps where each token prediction is regenerated using parallelized autoregression on the initial hypothesis. Experiments on multiple datasets across different languages demonstrate that HAINAN achieves efficiency parity with CTC in non-autoregressive mode and with TDT in autoregressive mode. In terms of accuracy, autoregressive HAINAN outperforms TDT and RNN-T, while non-autoregressive HAINAN significantly outperforms CTC. Semi-autoregressive inference further enhances the model's accuracy with minimal computational overhead, and even outperforms TDT results in some cases. These results highlight HAINAN's flexibility in balancing accuracy and speed, positioning it as a strong candidate for real-world speech recognition applications.
翻译:我们提出了一种新颖的语音识别架构——混合自回归推理Transducer(HAINAN),它扩展了Token-and-Duration Transducer(TDT)模型。通过使用随机掩码的预测器网络输出进行训练,HAINAN既支持使用所有网络组件的自回归推理,也支持不使用预测器的非自回归推理。此外,我们提出了一种新颖的半自回归推理范式:首先通过非自回归推理生成初始假设,随后在初始假设上通过并行化自回归对每个令牌预测进行重生成,以此完成多步精炼。在多种语言的不同数据集上的实验表明,HAINAN在非自回归模式下与CTC达到效率相当,在自回归模式下与TDT达到效率相当。在准确率方面,自回归HAINAN优于TDT和RNN-T,而非自回归HAINAN显著优于CTC。半自回归推理能以极小的计算开销进一步提升模型的准确率,在某些情况下甚至优于TDT的结果。这些结果凸显了HAINAN在平衡准确率与速度方面的灵活性,使其成为现实世界语音识别应用的有力候选者。