The rapid progress of large language models (LLMs) has opened up a new frontier for automatic speech recognition (ASR), making their effective integration a critical and challenging research direction. To this end, this work proposes a projector-based LLM-ASR framework targeting the key challenges of multilingual generalization and modality alignment. Our approach incorporates a Mixture of Experts (MoE) architecture to improve cross-lingual adaptability, and a Continuous Integrate-and-Fire (CIF) mechanism for dynamic downsampling and modality alignment. Experimental results show that the combination of these components yields substantial performance improvements, surpassing strong baseline models. The proposed method represents a step toward building more accurate, robust, and generalizable LLM-based ASR systems.
翻译:大语言模型(LLM)的快速发展为自动语音识别(ASR)开辟了新的前沿领域,如何实现二者的有效融合成为关键且富有挑战性的研究方向。为此,本文提出了一种基于投影器的LLM-ASR框架,聚焦多语言泛化与模态对齐两大核心挑战。该框架采用混合专家(MoE)架构提升跨语言适应性,并引入连续积分-触发(CIF)机制实现动态下采样与模态对齐。实验结果表明,上述组件的协同作用带来了显著的性能提升,超越了强基线模型。所提方法为构建更准确、鲁棒且泛化能力更强的LLM-ASR系统迈出了重要一步。