Achieving general-purpose robotic manipulation requires robots to seamlessly bridge high-level semantic intent with low-level physical interaction in unstructured environments. However, existing approaches falter in zero-shot generalization: end-to-end Vision-Language-Action (VLA) models often lack the precision required for long-horizon tasks, while traditional hierarchical planners suffer from semantic rigidity when facing open-world variations. To address this, we present UniManip, a framework grounded in a Bi-level Agentic Operational Graph (AOG) that unifies semantic reasoning and physical grounding. By coupling a high-level Agentic Layer for task orchestration with a low-level Scene Layer for dynamic state representation, the system continuously aligns abstract planning with geometric constraints, enabling robust zero-shot execution. Unlike static pipelines, UniManip operates as a dynamic agentic loop: it actively instantiates object-centric scene graphs from unstructured perception, parameterizes these representations into collision-free trajectories via a safety-aware local planner, and exploits structured memory to autonomously diagnose and recover from execution failures. Extensive experiments validate the system's robust zero-shot capability on unseen objects and tasks, demonstrating a 22.5% and 25.0% higher success rate compared to state-of-the-art VLA and hierarchical baselines, respectively. Notably, the system enables direct zero-shot transfer from fixed-base setups to mobile manipulation without fine-tuning or reconfiguration. Our open-source project page can be found at https://henryhcliu.github.io/unimanip.
翻译:实现通用机器人操控要求机器人能够在非结构化环境中无缝衔接高层语义意图与低层物理交互。然而,现有方法在零样本泛化方面存在不足:端到端的视觉-语言-动作模型往往缺乏执行长周期任务所需的精确性,而传统的分层规划器在面对开放世界变化时则受限于语义的僵化。为解决这一问题,我们提出了UniManip,这是一个基于双层智能操作图的框架,它统一了语义推理与物理基础。通过将用于任务编排的高层智能体层与用于动态状态表示的低层场景层相结合,该系统持续地将抽象规划与几何约束对齐,从而实现鲁棒的零样本执行。与静态流程不同,UniManip作为一个动态智能循环运行:它主动地从非结构化感知中实例化以物体为中心的场景图,通过一个安全感知的局部规划器将这些表示参数化为无碰撞轨迹,并利用结构化记忆来自主诊断执行故障并从中恢复。大量实验验证了该系统在未见过的物体和任务上具备鲁棒的零样本能力,其成功率分别比最先进的视觉-语言-动作模型和分层基线模型高出22.5%和25.0%。值得注意的是,该系统能够实现从固定基座设置到移动操控的直接零样本迁移,而无需微调或重新配置。我们的开源项目页面可在 https://henryhcliu.github.io/unimanip 找到。