Multimodal emotion understanding requires effective integration of text, audio, and visual modalities for both discrete emotion recognition and continuous sentiment analysis. We present EGMF, a unified framework combining expert-guided multimodal fusion with large language models. Our approach features three specialized expert networks--a fine-grained local expert for subtle emotional nuances, a semantic correlation expert for cross-modal relationships, and a global context expert for long-range dependencies--adaptively integrated through hierarchical dynamic gating for context-aware feature selection. Enhanced multimodal representations are integrated with LLMs via pseudo token injection and prompt-based conditioning, enabling a single generative framework to handle both classification and regression through natural language generation. We employ LoRA fine-tuning for computational efficiency. Experiments on bilingual benchmarks (MELD, CHERMA, MOSEI, SIMS-V2) demonstrate consistent improvements over state-of-the-art methods, with superior cross-lingual robustness revealing universal patterns in multimodal emotional expressions across English and Chinese. We will release the source code publicly.
翻译:多模态情感理解需要有效整合文本、音频和视觉模态,以实现离散情感识别与连续情感分析。本文提出EGMF,一个将专家引导多模态融合与大语言模型相结合的统一框架。我们的方法包含三个专用专家网络——用于捕捉细微情感差异的细粒度局部专家、建模跨模态关系的语义关联专家,以及处理长程依赖的全局上下文专家——它们通过分层动态门控机制自适应集成,实现上下文感知的特征选择。增强的多模态表征通过伪标记注入和基于提示的条件化方式与大语言模型集成,使单一生成式框架能够通过自然语言生成同时处理分类与回归任务。我们采用LoRA微调以提升计算效率。在双语基准数据集(MELD、CHERMA、MOSEI、SIMS-V2)上的实验表明,本方法相较于现有最优方法取得了持续改进,其卓越的跨语言鲁棒性揭示了英语与汉语多模态情感表达中的普适性模式。我们将公开源代码。