Embodied Conversational Agents (ECAs) aim to emulate human face-to-face interaction through speech, gestures, and facial expressions. Current large language model (LLM)-based conversational agents lack embodiment and the expressive gestures essential for natural interaction. Existing solutions for ECAs often produce rigid, low-diversity motions, that are unsuitable for human-like interaction. Alternatively, generative methods for co-speech gesture synthesis yield natural body gestures but depend on future speech context and require long run-times. To bridge this gap, we present MIBURI, the first online, causal framework for generating expressive full-body gestures and facial expressions synchronized with real-time spoken dialogue. We employ body-part aware gesture codecs that encode hierarchical motion details into multi-level discrete tokens. These tokens are then autoregressively generated by a two-dimensional causal framework conditioned on LLM-based speech-text embeddings, modeling both temporal dynamics and part-level motion hierarchy in real time. Further, we introduce auxiliary objectives to encourage expressive and diverse gestures while preventing convergence to static poses. Comparative evaluations demonstrate that our causal and real-time approach produces natural and contextually aligned gestures against recent baselines. We urge the reader to explore demo videos on https://vcai.mpi-inf.mpg.de/projects/MIBURI/.
翻译:具身对话智能体旨在通过语音、手势和面部表情模拟人类面对面交互。当前基于大语言模型的对话智能体缺乏具身性以及自然交互所必需的表现性手势。现有具身对话智能体解决方案常产生僵硬、低多样性的动作,不适用于类人交互。另一方面,协同语音手势生成的生成式方法虽能产生自然的身体手势,但依赖于未来语音上下文且需要较长运行时间。为弥合这一差距,我们提出了MIBURI——首个在线、因果的框架,用于生成与实时口语对话同步的表现性全身手势及面部表情。我们采用身体部位感知的手势编解码器,将分层运动细节编码为多级离散标记。这些标记随后由基于大语言模型语音文本嵌入的二维因果框架进行自回归生成,实时建模时间动态与部位级运动层次结构。此外,我们引入辅助目标函数以激励表现性、多样化的手势,同时防止收敛至静态姿态。对比评估表明,相较于近期基线方法,我们的因果实时方法能够生成更自然且与语境协调的手势。我们建议读者通过 https://vcai.mpi-inf.mpg.de/projects/MIBURI/ 浏览演示视频。