This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on a retrieval-based affordance transfer paradigm to acquire versatile manipulation capabilities from abundant out-of-domain data. First, RAM extracts unified affordance at scale from diverse sources of demonstrations including robotic data, human-object interaction (HOI) data, and custom data to construct a comprehensive affordance memory. Then given a language instruction, RAM hierarchically retrieves the most similar demonstration from the affordance memory and transfers such out-of-domain 2D affordance to in-domain 3D executable affordance in a zero-shot and embodiment-agnostic manner. Extensive simulation and real-world evaluations demonstrate that our RAM consistently outperforms existing works in diverse daily tasks. Additionally, RAM shows significant potential for downstream applications such as automatic and efficient data collection, one-shot visual imitation, and LLM/VLM-integrated long-horizon manipulation. For more details, please check our website at https://yxkryptonite.github.io/RAM/.
翻译:本研究提出了一种用于零样本机器人操作的检索与迁移框架,称为RAM,其具备跨不同物体、环境与实体形态的泛化能力。与现有方法依赖昂贵领域内演示学习操作不同,RAM采用基于检索的可转移性迁移范式,从丰富的领域外数据中获取通用操作能力。首先,RAM从机器人数据、人-物交互数据及定制数据等多源演示中大规模提取统一可转移性表征,构建全面的可转移性记忆库。随后,给定语言指令时,RAM分层从可转移性记忆库中检索最相似的演示,并以零样本且与实体形态无关的方式,将此类领域外二维可转移性迁移至领域内三维可执行可转移性。大量仿真与真实环境实验表明,我们的RAM在多样化日常任务中持续优于现有方法。此外,RAM在自动高效数据收集、单次视觉模仿及LLM/VLM融合的长时程操作等下游应用中展现出显著潜力。更多细节请访问我们的网站:https://yxkryptonite.github.io/RAM/。