Coordinating multiple embodied agents in dynamic environments remains a core challenge in artificial intelligence, requiring both perception-driven reasoning and scalable cooperation strategies. While recent works have leveraged large language models (LLMs) for multi-agent planning, a few have begun to explore vision-language models (VLMs) for visual reasoning. However, these VLM-based approaches remain limited in their support for diverse embodiment types. In this work, we introduce VIKI-Bench, the first hierarchical benchmark tailored for embodied multi-agent cooperation, featuring three structured levels: agent activation, task planning, and trajectory perception. VIKI-Bench includes diverse robot embodiments, multi-view visual observations, and structured supervision signals to evaluate reasoning grounded in visual inputs. To demonstrate the utility of VIKI-Bench, we propose VIKI-R, a two-stage framework that fine-tunes a pretrained vision-language model (VLM) using Chain-of-Thought annotated demonstrations, followed by reinforcement learning under multi-level reward signals. Our extensive experiments show that VIKI-R significantly outperforms baselines method across all task levels. Furthermore, we show that reinforcement learning enables the emergence of compositional cooperation patterns among heterogeneous agents. Together, VIKI-Bench and VIKI-R offer a unified testbed and method for advancing multi-agent, visual-driven cooperation in embodied AI systems.
翻译:在动态环境中协调多个具身智能体仍然是人工智能领域的核心挑战,这既需要感知驱动的推理能力,也需要可扩展的协作策略。尽管近期研究已利用大语言模型(LLMs)进行多智能体规划,但仅有少数工作开始探索视觉语言模型(VLMs)在视觉推理中的应用。然而,这些基于VLM的方法对多样化具身形态的支持仍存在局限。本研究提出首个面向具身多智能体协作的层次化基准测试VIKI-Bench,其包含三个结构化层级:智能体激活、任务规划与轨迹感知。该基准集成了多样化的机器人具身形态、多视角视觉观测以及结构化监督信号,用于评估基于视觉输入的推理能力。为验证VIKI-Bench的实用性,我们提出VIKI-R——一个两阶段框架:首先利用思维链标注的演示数据对预训练视觉语言模型(VLM)进行微调,随后在多层级奖励信号下进行强化学习。大量实验表明,VIKI-R在所有任务层级上均显著超越基线方法。此外,我们证明强化学习能够促使异构智能体间涌现组合式协作模式。VIKI-Bench与VIKI-R共同为推进具身人工智能系统中的多智能体视觉驱动协作提供了统一的测试平台与方法论。