Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown impressive performance across various speech-related tasks, especially in Automatic Speech Recognition (ASR). These models typically adopt a unified method to model discrete speech and text tokens, followed by training a decoder-only transformer. However, they are all designed for non-streaming ASR tasks, where the entire speech utterance is needed during decoding. Hence, we introduce a decoder-only model exclusively designed for streaming recognition, incorporating a dedicated boundary token to facilitate streaming recognition and employing causal attention masking during the training phase. Furthermore, we introduce right-chunk attention and various data augmentation techniques to improve the model's contextual modeling abilities. While achieving streaming speech recognition, experiments on the AISHELL-1 and -2 datasets demonstrate the competitive performance of our streaming approach with non-streaming decoder-only counterparts.
翻译:诸如SpeechGPT、VioLA和AudioPaLM等统一语音-文本模型已在多种语音相关任务中展现出令人印象深刻的性能,尤其是在自动语音识别领域。这些模型通常采用统一的方法对离散语音和文本标记进行建模,随后训练一个仅解码器的Transformer。然而,它们均设计用于非流式ASR任务,在解码过程中需要完整的语音话语。因此,我们引入了一种专为流式识别设计的仅解码器模型,该模型引入了一个专用的边界标记以促进流式识别,并在训练阶段采用了因果注意力掩码。此外,我们引入了右分块注意力机制以及多种数据增强技术,以提升模型的上下文建模能力。在实现流式语音识别的同时,在AISHELL-1和-2数据集上的实验表明,我们的流式方法与其非流式仅解码器对应模型相比具有竞争性的性能。