It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot needs to have a comprehensive understanding of the 3D structure and semantics of the scene. In this work, we present $\textbf{GNFactor}$, a visual behavior cloning agent for multi-task robotic manipulation with $\textbf{G}$eneralizable $\textbf{N}$eural feature $\textbf{F}$ields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model ($\textit{e.g.}$, Stable Diffusion) to distill rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor. Our project website is https://yanjieze.com/GNFactor/ .
翻译:在非结构化真实环境中,使机器人能够通过视觉观测执行多样化操作任务一直是机器人领域的长期难题。为实现这一目标,机器人需要全面理解场景的3D结构和语义信息。本文提出$\textbf{GNFactor}$——一种基于可泛化神经特征场的多任务机器人操作视觉行为克隆智能体。该模型通过联合优化可泛化神经场(GNF)重建模块与感知器Transformer决策模块,共享深度3D体素表征。为在三维空间中融入语义信息,重建模块利用视觉语言基础模型(如Stable Diffusion)将丰富的语义知识蒸馏到深度3D体素中。我们在3项真实机器人任务上评估GNFactor,并在10项RLBench任务中基于有限数量的演示进行详细消融实验。实验结果表明,GNFactor在已见与未见任务上均显著超越现有最优方法,展现出强大的泛化能力。项目网站地址:https://yanjieze.com/GNFactor/ 。