In this work, we propose a streaming AV-ASR system based on a hybrid connectionist temporal classification (CTC)/attention neural network architecture. The audio and the visual encoder neural networks are both based on the conformer architecture, which is made streamable using chunk-wise self-attention (CSA) and causal convolution. Streaming recognition with a decoder neural network is realized by using the triggered attention technique, which performs time-synchronous decoding with joint CTC/attention scoring. Additionally, we propose a novel alignment regularization technique that promotes synchronization of the audio and visual encoder, which in turn results in better word error rates (WERs) at all SNR levels for streaming and offline AV-ASR models. The proposed AV-ASR model achieves WERs of 2.0% and 2.6% on the Lip Reading Sentences 3 (LRS3) dataset in an offline and online setup, respectively, which both present state-of-the-art results when no external training data are used.
翻译:本文提出了一种基于混合连接时序分类(CTC)/注意力神经网络架构的流式AV-ASR系统。音频和视觉编码器神经网络均采用Conformer架构,并通过分块自注意力(CSA)和因果卷积实现流式处理。通过使用触发注意力技术——该技术结合CTC/注意力评分进行时间同步解码——实现了基于解码器神经网络的流式识别。此外,我们提出了一种新颖的对齐正则化技术,该技术促进了音频和视觉编码器的同步,从而在流式和离线AV-ASR模型的所有信噪比(SNR)水平上均获得了更优的词错误率(WER)。所提出的AV-ASR模型在唇读句子3(LRS3)数据集上的离线与在线设置中分别达到了2.0%和2.6%的WER,均为未使用外部训练数据时的最先进结果。