Robots are increasingly expected to execute open ended natural language requests in human environments, which demands reliable long horizon execution under partial observability. This is especially challenging for humanoids because locomotion and manipulation are tightly coupled through stance, reachability, and balance. We present a humanoid agent framework that turns VLM plans into verifiable task programs and closes the loop with multi object 3D geometric supervision. A VLM planner compiles each instruction into a typed JSON sequence of subtasks with explicit predicate based preconditions and success conditions. Using SAM3 and RGB-D, we ground all task relevant entities in 3D, estimate object centroids and extents, and evaluate predicates over stable frames to obtain condition level diagnostics. The supervisor uses these diagnostics to verify subtask completion and to provide condition-level feedback for progression and replanning. We execute each subtask by coordinating humanoid locomotion and whole-body manipulation, selecting feasible motion primitives under reachability and balance constraints. Experiments on tabletop manipulation and long horizon humanoid loco manipulation tasks show improved robustness from multi object grounding, temporal stability, and recovery driven replanning.
翻译:机器人日益需要在人类环境中执行开放式的自然语言指令,这要求其在部分可观测条件下具备可靠的长时程执行能力。对于仿人机器人而言,这一挑战尤为严峻,因为其运动与操作通过支撑状态、可达性与平衡约束紧密耦合。本文提出一种仿人机器人智能体框架,该框架将视觉语言模型(VLM)生成的规划转化为可验证的任务程序,并通过多目标三维几何监督实现闭环控制。VLM规划器将每条指令编译为具有显式基于谓词的前置条件与成功条件的类型化JSON子任务序列。利用SAM3模型与RGB-D数据,我们在三维空间中锚定所有任务相关实体,估计物体质心与边界范围,并在稳定帧上评估谓词以获取条件级诊断信息。监督器利用这些诊断结果验证子任务完成状态,并为任务推进与重规划提供条件级反馈。每个子任务通过协调仿人机器人的全身运动与操作来执行,在满足可达性与平衡约束的前提下选择可行的运动基元。在桌面操作任务及长时程仿人机器人运动-操作协同任务上的实验表明,多目标三维锚定、时序稳定性与基于恢复的重规划机制显著提升了系统的鲁棒性。