UniEnc-CASSNAT: An Encoder-only Non-autoregressive ASR for Speech SSL Models

Non-autoregressive automatic speech recognition (NASR) models have gained attention due to their parallelism and fast inference. The encoder-based NASR, e.g. connectionist temporal classification (CTC), can be initialized from the speech foundation models (SFM) but does not account for any dependencies among intermediate tokens. The encoder-decoder-based NASR, like CTC alignment-based single-step non-autoregressive transformer (CASS-NAT), can mitigate the dependency problem but is not able to efficiently integrate SFM. Inspired by the success of recent work of speech-text joint pre-training with a shared transformer encoder, we propose a new encoder-based NASR, UniEnc-CASSNAT, to combine the advantages of CTC and CASS-NAT. UniEnc-CASSNAT consists of only an encoder as the major module, which can be the SFM. The encoder plays the role of both the CASS-NAT encoder and decoder by two forward passes. The first pass of the encoder accepts the speech signal as input, while the concatenation of the speech signal and the token-level acoustic embedding is used as the input for the second pass. Examined on the Librispeech 100h, MyST, and Aishell1 datasets, the proposed UniEnc-CASSNAT achieves state-of-the-art NASR results and is better or comparable to CASS-NAT with only an encoder and hence, fewer model parameters. Our codes are publicly available.

翻译：非自回归自动语音识别（NASR）模型因具有并行性和快速推理能力而受到关注。基于编码器的NASR（例如连接时序分类（CTC））可从语音基础模型（SFM）初始化，但未建模中间标记之间的依赖关系。基于编码器-解码器的NASR（如基于CTC对齐的单步非自回归Transformer（CASS-NAT））可缓解依赖问题，但无法高效集成SFM。受近期基于共享Transformer编码器进行语音-文本联合预训练的成功工作的启发，我们提出了一种新的基于编码器的NASR模型——UniEnc-CASSNAT，以融合CTC和CASS-NAT的优势。UniEnc-CASSNAT仅包含一个编码器作为主要模块，该编码器可直接采用SFM。该编码器通过两次前向传递同时扮演CASS-NAT编码器和解码器的角色：第一次前向传递以语音信号为输入，第二次前向传递则以语音信号与词级声学嵌入的拼接结果为输入。在Librispeech 100h、MyST和Aishell1数据集上的实验表明，所提出的UniEnc-CASSNAT取得了当前最优的NASR结果，且性能优于或可比于CASS-NAT——仅需一个编码器，因此模型参数更少。我们的代码已公开。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日