Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation of the speech signal and the token-level acoustic embedding is used as the input for the second pass. Examined on the Librispeech 100h, MyST, and Aishell1 datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder and hence, fewer model parameters. Our codes are publicly available.
翻译:非自回归自动语音识别(NASR)模型因具有并行性和快速推理能力而受到关注。基于编码器的NASR(例如连接时序分类(CTC))可从语音基础模型(SFM)初始化,但未建模中间标记之间的依赖关系。基于编码器-解码器的NASR(如基于CTC对齐的单步非自回归Transformer(CASS-NAT))可缓解依赖问题,但无法高效集成SFM。受近期基于共享Transformer编码器进行语音-文本联合预训练的成功工作的启发,我们提出了一种新的基于编码器的NASR模型——UniEnc-CASSNAT,以融合CTC和CASS-NAT的优势。UniEnc-CASSNAT仅包含一个编码器作为主要模块,该编码器可直接采用SFM。该编码器通过两次前向传递同时扮演CASS-NAT编码器和解码器的角色:第一次前向传递以语音信号为输入,第二次前向传递则以语音信号与词级声学嵌入的拼接结果为输入。在Librispeech 100h、MyST和Aishell1数据集上的实验表明,所提出的UniEnc-CASSNAT取得了当前最优的NASR结果,且性能优于或可比于CASS-NAT——仅需一个编码器,因此模型参数更少。我们的代码已公开。