Human-robot interaction is increasingly moving toward multi-robot, socially grounded environments. Existing systems struggle to integrate multimodal perception, embodied expression, and coordinated decision-making in a unified framework. This limits natural and scalable interaction in shared physical spaces. We address this gap by introducing a multimodal framework for human-multi-agent interaction in which each robot operates as an autonomous cognitive agent with integrated multimodal perception and Large Language Model (LLM)-driven planning grounded in embodiment. At the team level, a centralized coordination mechanism regulates turn-taking and agent participation to prevent overlapping speech and conflicting actions. Implemented on two humanoid robots, our framework enables coherent multi-agent interaction through interaction policies that combine speech, gesture, gaze, and locomotion. Representative interaction runs demonstrate coordinated multimodal reasoning across agents and grounded embodied responses. Future work will focus on larger-scale user studies and deeper exploration of socially grounded multi-agent interaction dynamics.
翻译:人类与机器人交互正日益向多机器人、基于社交情境的环境发展。现有系统难以将多模态感知、具身表达与协调决策整合至统一框架中,这限制了共享物理空间内自然且可扩展的交互。为解决这一不足,我们提出了一种面向人类-多智能体交互的多模态框架。在该框架中,每个机器人作为自主认知智能体运行,具备集成的多模态感知能力,并基于具身性以大语言模型驱动的规划为核心。在团队层面,集中式协调机制规范话轮转换与智能体参与,以防止语音重叠与行为冲突。通过在两台仿人机器人上的实现,我们的框架借助融合语音、手势、视线与运动行为的交互策略,实现了连贯的多智能体交互。代表性交互运行展示了跨智能体的协调多模态推理以及基于具身的响应。未来工作将聚焦更大规模的用户研究,并深入探索基于社交情境的多智能体交互动态。