The emergence of industrial-scale speech recognition (ASR) models such as Whisper and USM, trained on 1M hours of weakly labelled and 12M hours of audio only proprietary data respectively, has led to a stronger need for large scale public ASR corpora and competitive open source pipelines. Unlike the said models, large language models are typically based on Transformer decoders, and it remains unclear if decoder-only models trained on public data alone can deliver competitive performance. In this work, we investigate factors such as choice of training datasets and modeling components necessary for obtaining the best performance using public English ASR corpora alone. Our Decoder-Only Transformer for ASR (DOTA) model comprehensively outperforms the encoder-decoder open source replication of Whisper (OWSM) on nearly all English ASR benchmarks and outperforms Whisper large-v3 on 7 out of 15 test sets. We release our codebase and model checkpoints under permissive license.
翻译:工业级语音识别(ASR)模型(如Whisper和USM)的涌现——它们分别使用100万小时弱标注数据和1200万小时仅含音频的专有数据进行训练——推动了大规模公共ASR语料库和竞争性开源流水线的强烈需求。与上述模型不同,大语言模型通常基于Transformer解码器,但目前尚不明确仅依靠公共数据训练的仅解码器模型能否达到具有竞争力的性能。本研究通过探索训练数据集的选择和建模组件等因素,旨在仅使用公共英语ASR语料库实现最佳性能。我们提出的用于ASR的仅解码器Transformer(DOTA)模型在几乎所有英语ASR基准测试中全面优于开源的Whisper编码器-解码器复现版本(OWSM),并在15个测试集中于7个测试集上超越Whisper large-v3。我们以宽松许可证发布代码库和模型检查点。