Large Language Models have demonstrated unparalleled effectiveness in various NLP tasks, and integrating LLMs with automatic speech recognition is becoming a mainstream paradigm. Building upon this momentum, our research delves into an indepth examination of this paradigm on a large opensource Chinese dataset. Specifically, our research aims to evaluate the impact of various configurations of speech encoders, LLMs, and projector modules in the context of the speech foundation encoderLLM ASR paradigm. Furthermore, we introduce a threestage training approach, expressly developed to enhance the model's ability to align auditory and textual information. The implementation of this approach, alongside the strategic integration of ASR components, enabled us to achieve the SOTA performance on the AISHELL1, TestNet, and TestMeeting test sets. Our analysis presents an empirical foundation for future research in LLMbased ASR systems and offers insights into optimizing performance using Chinese datasets. We will publicly release all scripts used for data preparation, training, inference, and scoring, as well as pretrained models and training logs to promote reproducible research.
翻译:大型语言模型在各种自然语言处理任务中展现了无与伦比的有效性,将LLM与自动语音识别相结合正成为主流范式。基于这一趋势,我们的研究深入探讨了该范式在大型中文开源数据集上的应用。具体而言,本研究旨在评估语音基础编码器-LLM-ASR范式下语音编码器、LLM和投影模块不同配置的影响。此外,我们引入了一种三阶段训练方法,专门开发以增强模型对齐听觉和文本信息的能力。该方法与ASR组件的战略整合,使我们在AISHELL-1、TestNet和TestMeeting测试集上达到了最先进的性能。我们的分析为未来基于LLM的ASR系统研究提供了实证基础,并为使用中文数据集优化性能提供了见解。我们将公开发布所有用于数据准备、训练、推理和评分的脚本,以及预训练模型和训练日志,以促进可重复研究。