It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot needs to have a comprehensive understanding of the 3D structure and semantics of the scene. In this work, we present $\textbf{GNFactor}$, a visual behavior cloning agent for multi-task robotic manipulation with $\textbf{G}$eneralizable $\textbf{N}$eural feature $\textbf{F}$ields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model ($\textit{e.g.}$, Stable Diffusion) to distill rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor. Our project website is https://yanjieze.com/GNFactor/ .
翻译:在机器人学中,如何开发能够根据视觉观察在非结构化真实世界环境中执行多样化操作任务的智能体,是一个长期存在的问题。为实现这一目标,机器人需要对场景的三维结构和语义有全面的理解。本文提出 $\textbf{GNFactor}$,一种用于多任务机器人操作的视觉行为克隆智能体,其核心是 $\textbf{G}$eneralizable $\textbf{N}$eural feature $\textbf{F}$ields(可泛化神经特征场)。GNFactor 联合优化了一个作为重建模块的可泛化神经场(GNF)和一个作为决策模块的 Perceiver Transformer,两者共享一个深层三维体素表示。为了在三维空间中融入语义信息,重建模块利用视觉-语言基础模型($\textit{例如}$ Stable Diffusion)将丰富的语义信息提炼到深层三维体素中。我们在 3 项真实机器人任务上评估了 GNFactor,并在仅有少量演示的 10 项 RLBench 任务上进行了详细的消融实验。我们观察到,GNFactor 在当前最先进方法的基础上取得了显著提升,无论是在已见任务还是未见任务上,这证明了 GNFactor 强大的泛化能力。我们的项目网站是 https://yanjieze.com/GNFactor/ 。