Effective human-robot interaction requires emotionally rich multimodal expressions, yet most humanoid robots lack coordinated speech, facial expressions, and gestures. Meanwhile, real-world deployment demands on-device solutions that can operate autonomously without continuous cloud connectivity. To bridging \underline{\textit{S}}peech, \underline{\textit{E}}motion, and \underline{\textit{M}}otion, we present \textit{SeM$^2$}, a Vision Language Model-based framework that orchestrates emotionally coherent multimodal interactions through three key components: a multimodal perception module capturing user contextual cues, a Chain-of-Thought reasoning for response planning, and a novel Semantic-Sequence Aligning Mechanism (SSAM) that ensures precise temporal coordination between verbal content and physical expressions. We implement both cloud-based and \underline{\textit{e}}dge-deployed versions (\textit{SeM$^2_e$}), with the latter knowledge distilled to operate efficiently on edge hardware while maintaining 95\% of the relative performance. Comprehensive evaluations demonstrate that our approach significantly outperforms unimodal baselines in naturalness, emotional clarity, and modal coherence, advancing socially expressive humanoid robotics for diverse real-world environments.
翻译:高效的人机交互需要情感丰富的多模态表达,然而多数人形机器人缺乏协调的语音、面部表情与手势。同时,实际部署要求设备端解决方案能够自主运行,无需持续云端连接。为桥接语音、情感与动作,我们提出SeM²——一个基于视觉语言模型的框架,通过三个核心组件协调情感一致的多模态交互:捕获用户情境线索的多模态感知模块、用于响应规划的思维链推理机制,以及新颖的语义序列对齐机制(SSAM),该机制确保语言内容与肢体表达间精确的时间同步。我们实现了云端与边缘部署版本(SeM²ₑ),后者通过知识蒸馏在边缘硬件上高效运行,同时保持95%的相对性能。综合评估表明,该方法在自然度、情感清晰度与模态协调性上显著优于单模态基线,推动了面向多样化现实场景的社交表达型人形机器人技术发展。