Autonomous robotic systems capable of learning novel manipulation tasks are poised to transform industries from manufacturing to service automation. However, modern methods (e.g., VIP and R3M) still face significant hurdles, notably the domain gap among robotic embodiments and the sparsity of successful task executions within specific action spaces, resulting in misaligned and ambiguous task representations. We introduce Ag2Manip (Agent-Agnostic representations for Manipulation), a framework aimed at surmounting these challenges through two key innovations: a novel agent-agnostic visual representation derived from human manipulation videos, with the specifics of embodiments obscured to enhance generalizability; and an agent-agnostic action representation abstracting a robot's kinematics to a universal agent proxy, emphasizing crucial interactions between end-effector and object. Ag2Manip's empirical validation across simulated benchmarks like FrankaKitchen, ManiSkill, and PartManip shows a 325% increase in performance, achieved without domain-specific demonstrations. Ablation studies underline the essential contributions of the visual and action representations to this success. Extending our evaluations to the real world, Ag2Manip significantly improves imitation learning success rates from 50% to 77.5%, demonstrating its effectiveness and generalizability across both simulated and physical environments.
翻译:具备学习新颖操作任务的自主机器人系统有望推动从制造业到服务自动化的产业变革。然而,现代方法(如VIP与R3M)仍面临重大挑战,特别是机器人实体间的领域鸿沟以及在特定动作空间中成功任务执行的稀疏性问题,导致任务表征出现错位与歧义。本文提出Ag2Manip(面向操作的智能体无关表征)框架,通过两项关键创新克服上述挑战:其一,从人类操作视频中提取新颖的智能体无关视觉表征,通过模糊实体细节提升泛化能力;其二,构建智能体无关的动作表征,将机器人运动学抽象为通用智能体代理,聚焦末端执行器与物体间的关键交互。在FrankaKitchen、ManiSkill及PartManip等仿真基准测试中,Ag2Manip无需领域特定演示即可实现325%的性能提升。消融实验证实,视觉与动作表征对该成功具有根本性贡献。将评估延伸至真实场景后,Ag2Manip将模仿学习成功率从50%显著提升至77.5%,充分验证其在仿真与物理环境中的有效性与泛化能力。