There has been an increasing interest in large speech models that can perform multiple tasks in a single model. Such models usually adopt an encoder-decoder or decoder-only architecture due to their popularity and good performance in many domains. However, autoregressive models can be slower during inference compared to non-autoregressive models and also have potential risks of hallucination. Though prior studies observed promising results of non-autoregressive models for certain tasks at small scales, it remains unclear if they can be scaled to speech-to-text generation in diverse languages and tasks. Inspired by the Open Whisper-style Speech Model (OWSM) project, we propose OWSM-CTC, a novel encoder-only speech foundation model based on Connectionist Temporal Classification (CTC). It is trained on 180k hours of public audio data for multilingual automatic speech recognition (ASR), speech translation (ST), and language identification (LID). Compared to encoder-decoder OWSM, our OWSM-CTC achieves competitive results on ASR and up to 24% relative improvement on ST, while it is more robust and 3 to 4 times faster for inference. OWSM-CTC also improves the long-form ASR result with 20x speed-up. We will publicly release our code, pre-trained model, and training logs to promote open science in speech foundation models.
翻译:能够执行多任务的语音大模型正受到越来越多的关注。此类模型通常采用编码器-解码器或仅解码器架构,因为它们在许多领域广受欢迎且性能优异。然而,与自回归模型相比,非自回归模型在推理时可能更慢,并且存在幻觉生成的潜在风险。尽管先前研究在小规模上观察到非自回归模型在某些任务上的良好结果,但其能否扩展至多种语言与任务的语音到文本生成仍不明确。受开放式 Whisper 风格语音模型项目的启发,我们提出了 OWSM-CTC,一种基于连接时序分类的新型仅编码器语音基础模型。该模型在 180,000 小时的公开音频数据上进行训练,支持多语言自动语音识别、语音翻译和语言识别任务。与编码器-解码器架构的 OWSM 相比,我们的 OWSM-CTC 在自动语音识别任务上取得了具有竞争力的结果,在语音翻译任务上实现了高达 24% 的相对性能提升,同时具备更强的鲁棒性,且推理速度提升了 3 至 4 倍。OWSM-CTC 在长语音自动语音识别任务中也取得了改进,并实现了 20 倍的加速。我们将公开代码、预训练模型及训练日志,以推动语音基础模型领域的开放科学发展。