Non-autoregressive automatic speech recognition (ASR) has become a mainstream of ASR modeling because of its fast decoding speed and satisfactory result. To further boost the performance, relaxing the conditional independence assumption and cascading large-scaled pre-trained models are two active research directions. In addition to these strategies, we propose a lexical-aware non-autoregressive Transformer-based (LA-NAT) ASR framework, which consists of an acoustic encoder, a speech-text shared encoder, and a speech-text shared decoder. The acoustic encoder is used to process the input speech features as usual, and the speech-text shared encoder and decoder are designed to train speech and text data simultaneously. By doing so, LA-NAT aims to make the ASR model aware of lexical information, so the resulting model is expected to achieve better results by leveraging the learned linguistic knowledge. A series of experiments are conducted on the AISHELL-1, CSJ, and TEDLIUM 2 datasets. According to the experiments, the proposed LA-NAT can provide superior results than other recently proposed non-autoregressive ASR models. In addition, LA-NAT is a relatively compact model than most non-autoregressive ASR models, and it is about 58 times faster than the classic autoregressive model.
翻译:非自回归自动语音识别(ASR)因其快速的解码速度和令人满意的结果,已成为ASR建模的主流方法。为了进一步提升性能,放宽条件独立性假设和级联大规模预训练模型是两个活跃的研究方向。除这些策略外,我们提出了一种基于词法感知的非自回归Transformer(LA-NAT)ASR框架,该框架由声学编码器、语音-文本共享编码器和语音-文本共享解码器组成。声学编码器按常规方式处理输入语音特征,而语音-文本共享编码器和解码器则用于同时训练语音和文本数据。通过这种方式,LA-NAT旨在让ASR模型感知词法信息,因此预期该模型能通过利用学到的语言知识获得更优结果。我们在AISHELL-1、CSJ和TEDLIUM 2数据集上进行了一系列实验。实验结果表明,所提出的LA-NAT能提供比其他近期提出的非自回归ASR模型更优的结果。此外,LA-NAT相较于大多数非自回归ASR模型更为紧凑,且其解码速度比经典的自回归模型快约58倍。