Advances in machine learning have made it possible to perform various text and speech processing tasks, including automatic speech recognition (ASR), in an end-to-end (E2E) manner. Since typical E2E approaches require large amounts of training data and resources, leveraging pre-trained foundation models instead of training from scratch is gaining attention. Although there have been attempts to use pre-trained speech and language models in ASR, most of them are limited to using either. This paper explores the potential of integrating a pre-trained speech representation model with a large language model (LLM) for E2E ASR. The proposed model enables E2E ASR by generating text tokens in an autoregressive manner via speech representations as speech prompts, taking advantage of the vast knowledge provided by the LLM. Furthermore, the proposed model can incorporate remarkable developments for LLM utilization, such as inference optimization and parameter-efficient domain adaptation. Experimental results show that the proposed model achieves performance comparable to modern E2E ASR models.
翻译:机器学习领域的最新进展使得能够以端到端方式执行各类文本与语音处理任务,包括自动语音识别。由于典型的端到端方法需要大量训练数据和资源,利用预训练基础模型而非从零训练正日益受到关注。尽管已有研究尝试在自动语音识别中使用预训练语音与语言模型,但多数方法局限于使用其中一类模型。本文探索了将预训练语音表征模型与大型语言模型相融合以构建端到端自动语音识别系统的潜力。所提模型通过将语音表征作为语音提示,以自回归方式生成文本令牌,从而借助大型语言模型提供的海量知识实现端到端自动语音识别。此外,该模型能够融合大型语言模型应用中的显著进展,例如推理优化与参数高效领域适配。实验结果表明,所提模型达到了与现代端到端自动语音识别模型相媲美的性能水平。