Recent advances in large vision-language models (VLMs) have demonstrated generalizable open-vocabulary perception and reasoning, yet their real-robot manipulation capability remains unclear for long-horizon, closed-loop execution in unstructured, in-the-wild environments. Prior VLM-based manipulation pipelines are difficult to compare across different research groups' setups, and many evaluations rely on simulation, privileged state, or specially designed setups. We present AgenticLab, a model-agnostic robot agent platform and benchmark for open-world manipulation. AgenticLab provides a closed-loop agent pipeline for perception, task decomposition, online verification, and replanning. Using AgenticLab, we benchmark state-of-the-art VLM-based agents on real-robot tasks in unstructured environments. Our benchmark reveals several failure modes that offline vision-language tests (e.g., VQA and static image understanding) fail to capture, including breakdowns in multi-step grounding consistency, object grounding under occlusion and scene changes, and insufficient spatial reasoning for reliable manipulation. We will release the full hardware and software stack to support reproducible evaluation and accelerate research on general-purpose robot agents.
翻译:近年来,大型视觉语言模型(VLMs)在泛化性开放词汇感知与推理方面取得了显著进展,然而其在非结构化、真实世界环境中执行长时程、闭环操作的真实机器人操控能力尚不明确。现有的基于VLM的操控流程难以在不同研究团队的实验设置间进行比较,且许多评估依赖于仿真、特权状态或专门设计的实验环境。本文提出了AgenticLab,一个面向开放世界操控任务的模型无关机器人智能体平台与基准测试系统。AgenticLab提供了一个闭环智能体流程,涵盖感知、任务分解、在线验证与重规划。利用AgenticLab,我们在非结构化环境的真实机器人任务上对最先进的基于VLM的智能体进行了基准测试。我们的基准测试揭示了离线视觉语言测试(例如视觉问答和静态图像理解)未能捕捉的若干失效模式,包括多步语义落地一致性失效、遮挡与场景变化下的物体定位失败,以及空间推理能力不足以支撑可靠操作等问题。我们将发布完整的硬件与软件栈,以支持可复现的评估并加速通用机器人智能体的研究。