Foundation models (FMs), that are trained on broad data at scale and are adaptable to a wide range of downstream tasks, have brought large interest in the research community. Benefiting from the diverse data sources such as different modalities, languages and application domains, foundation models have demonstrated strong generalization and knowledge transfer capabilities. In this paper, we present a pioneering study towards building an efficient solution for FM-based speech recognition systems. We adopt the recently developed self-supervised BEST-RQ for pretraining, and propose the joint finetuning with both source and unsupervised target domain data using JUST Hydra. The FM encoder adapter and decoder are then finetuned to the target domain with a small amount of supervised in-domain data. On a large-scale YouTube and Voice Search task, our method is shown to be both data and model parameter efficient. It achieves the same quality with only 21.6M supervised in-domain data and 130.8M finetuned parameters, compared to the 731.1M model trained from scratch on additional 300M supervised in-domain data.
翻译:基础模型(FMs)在大规模多样化数据上训练,并能适配多种下游任务,已引起研究界的广泛关注。受益于多模态、多语言及多应用领域等多样化数据源,基础模型展现出强大的泛化与知识迁移能力。本文率先探索了构建基于FM的语音识别系统高效解决方案。我们采用近期提出的自监督BEST-RQ进行预训练,并利用JUST Hydra联合微调源域与无监督目标域数据。随后,通过少量有监督领域内数据对FM编码器适配器与解码器进行目标域微调。在大规模YouTube和语音搜索任务上,本方法在数据与模型参数层面均展现出高效性:仅需2160万条有监督领域内数据与1.308亿微调参数,即可达到与使用额外3亿有监督领域内数据从头训练的7.311亿参数模型相同的质量。