Large Language Models (LLMs) have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition (ASR) is becoming a mainstream paradigm. Building upon this momentum, our research delves into an in-depth examination of this paradigm on a large open-source Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we introduce a three-stage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and Test_Meeting test sets. Our analysis presents an empirical foundation for future research in LLM-based ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pre-trained models and training logs to promote reproducible research.
翻译:大型语言模型(LLM)在各种自然语言处理任务中展现出无与伦比的有效性,将LLM与自动语音识别(ASR)相结合正成为主流范式。基于这一趋势,我们的研究在一个大规模开源中文数据集上对此范式进行了深入探究。具体而言,本研究旨在评估语音基础编码器-LLM ASR范式中不同语音编码器、LLM及投影模块配置的影响。此外,我们提出了一种三阶段训练方法,专门用于增强模型对齐听觉与文本信息的能力。通过实施该方法并结合ASR组件的策略性整合,我们在AISHELL-1、Test_Net和Test_Meeting测试集上实现了最先进的性能。我们的分析为未来基于LLM的ASR系统研究提供了实证基础,并为利用中文数据集优化性能提供了见解。我们将公开发布所有用于数据准备、训练、推理和评分的脚本,以及预训练模型和训练日志,以促进可复现的研究。