The INTERSPEECH 2025 Challenge on Multilingual Conversational Speech Language Models (MLC-SLM) promotes multilingual conversational ASR with large language models (LLMs). Our previous SHNU-mASR system adopted a competitive parallel-speech-encoder architecture that integrated Whisper and mHuBERT with an LLM. However, it faced two challenges: simple feature concatenation may not fully exploit complementary information, and the performance gap between LLM-based ASR and end-to-end(E2E) encoder-decoder ASR remained unexplored. In this work, we present an enhanced LLM-based ASR framework that combines fine-tuned Whisper and mHuBERT encoders with an LLM to enrich speech representations. We first evaluate E2E Whisper models with LoRA and full fine-tuning on the MLC-SLM ASR task, and then propose cross-attention-based fusion mechanisms for the parallel-speech-encoder. On the official evaluation set of the MLC-SLM Challenge, our system achieves a CER/WER of 10.69%, ranking on par with the top-ranked Track 1 systems, even though it uses only 1,500 hours of baseline training data compared with their large-scale training sets. Nonetheless, we find that our final LLM-based ASR still does not match the performance of a fine-tuned E2E Whisper model, providing valuable empirical guidance for future Speech-LLM design. Our code is publicly available at https://github.com/1535176727/MLC-SLM.
翻译:INTERSPEECH 2025多语言对话语音语言模型挑战赛旨在推动基于大语言模型的多语言对话式自动语音识别研究。我们先前提出的SHNU-mASR系统采用了一种具有竞争力的并行语音编码器架构,该架构将Whisper与mHuBERT集成至一个LLM中。然而,该系统面临两个挑战:简单的特征拼接可能无法充分利用互补信息,且基于LLM的ASR与端到端编码器-解码器ASR之间的性能差距尚未得到充分探索。本研究提出了一种增强的基于LLM的ASR框架,该框架结合了微调后的Whisper与mHuBERT编码器以及一个LLM,以丰富语音表征。我们首先评估了使用LoRA和全参数微调的E2E Whisper模型在MLC-SLM ASR任务上的表现,随后为并行语音编码器提出了基于交叉注意力的融合机制。在MLC-SLM挑战赛的官方评估集上,我们的系统取得了10.69%的CER/WER,与排名最高的Track 1系统性能相当,尽管我们仅使用了1,500小时的基线训练数据,而后者使用了大规模训练集。尽管如此,我们发现我们最终的基于LLM的ASR仍然未能达到微调E2E Whisper模型的性能,这为未来的Speech-LLM设计提供了宝贵的实证指导。我们的代码已在https://github.com/1535176727/MLC-SLM公开。