Extending the input modality of Large Language Models~(LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically \textit{heterogeneous}, entangling attributes such as speech, music, and environmental context. Existing research is limited to a dense, parameter-shared adapter to model these diverse patterns, which induces \textit{gradient conflict} during optimization, as parameter updates required for distinct attributes contradict each other. To address this limitation, we introduce the \textit{\textbf{MoE-Adapter}}, a sparse Mixture-of-Experts~(MoE) architecture designed to decouple acoustic information. Specifically, it employs a dynamic gating mechanism that routes audio tokens to specialized experts capturing complementary feature subspaces while retaining shared experts for global context, thereby mitigating gradient conflicts and enabling fine-grained feature learning. Comprehensive experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks, consistently outperforming dense linear baselines with comparable computational costs. Furthermore, we will release the related code and models to facilitate future research.
翻译:将大型语言模型(LLM)的输入模态扩展至音频领域对于实现全面的多模态感知至关重要。然而,众所周知,声学信息本质上是**异构的**,其属性(如语音、音乐和环境上下文)相互交织。现有研究局限于使用密集的参数共享适配器来建模这些多样化的模式,这会在优化过程中引发**梯度冲突**,因为不同属性所需的参数更新相互矛盾。为克服这一局限,我们提出了**MoE-Adapter**,一种稀疏的专家混合(MoE)架构,旨在解耦声学信息。具体而言,它采用动态门控机制,将音频令牌路由至捕获互补特征子空间的专用专家,同时保留共享专家以处理全局上下文,从而缓解梯度冲突并实现细粒度特征学习。综合实验表明,MoE-Adapter在音频语义和副语言任务上均取得优异性能,在计算成本相当的情况下持续优于密集线性基线模型。此外,我们将公开相关代码与模型以促进未来研究。