The evolution of Omni-Modal Large Language Models~(Omni-LLMs) has revolutionized human--computer interaction, enabling unified audio-visual perception and speech response. However, existing Omni-LLMs struggle with complex real-world scenarios, often leading to superficial understanding and contextually mismatched emotional responses. This issue is further intensified by Omni-LLM's Thinker-Talker architectures, which are implicitly connected through hidden states, leading to the loss of emotional details. In this work, we present EmoOmni, a unified framework for accurate understanding and expression in multimodal emotional dialogue. At its core, we introduce the emotional Chain-of-Thought~(E-CoT), which enforces a reasoning from fine-grained multimodal perception to textual response. Moreover, we explicitly treat E-CoT as high-level emotional instructions that guide the talker, enabling accurate emotional expression. Complementing the model, we construct EmoOmniPipe to obtain the real-world annotated dialogue data and establish a benchmark, EmoOmniEval, to facilitate systematic assessment of multimodal emotional dialogue task. Experiments show that EmoOmni-7B achieves comparable performance with Qwen3Omni-30B-A3B-Thinking under the same talker.
翻译:全模态大语言模型(Omni-LLMs)的发展革新了人机交互,实现了统一的视听感知与语音响应。然而,现有的全模态大语言模型在处理复杂的现实世界场景时存在困难,常导致理解流于表面及情感响应与上下文失配。此问题因其“思考者-讲述者”架构而进一步加剧,该架构通过隐藏状态隐式连接,导致情感细节的丢失。在本工作中,我们提出了EmoOmni,一个用于多模态情感对话中精准理解与表达的统一框架。其核心在于我们引入了情感思维链(E-CoT),该机制强制模型执行从细粒度多模态感知到文本响应的推理过程。此外,我们明确将E-CoT视为高级情感指令,用以指导讲述者,从而实现精准的情感表达。作为模型的补充,我们构建了EmoOmniPipe以获取真实世界的标注对话数据,并建立了一个基准测试集EmoOmniEval,以促进对多模态情感对话任务的系统性评估。实验表明,在采用相同讲述者的情况下,EmoOmni-7B模型取得了与Qwen3Omni-30B-A3B-Thinking模型相当的性能。