We introduce GROOT, an imitation learning method for learning robust policies with object-centric and 3D priors. GROOT builds policies that generalize beyond their initial training conditions for vision-based manipulation. It constructs object-centric 3D representations that are robust toward background changes and camera views and reason over these representations using a transformer-based policy. Furthermore, we introduce a segmentation correspondence model that allows policies to generalize to new objects at test time. Through comprehensive experiments, we validate the robustness of GROOT policies against perceptual variations in simulated and real-world environments. GROOT's performance excels in generalization over background changes, camera viewpoint shifts, and the presence of new object instances, whereas both state-of-the-art end-to-end learning methods and object proposal-based approaches fall short. We also extensively evaluate GROOT policies on real robots, where we demonstrate the efficacy under very wild changes in setup. More videos and model details can be found in the appendix and the project website: https://ut-austin-rpl.github.io/GROOT .
翻译:我们提出GROOT,一种利用物体中心和三维先验学习鲁棒策略的模仿学习方法。GROOT构建的策略能够泛化到初始训练条件之外的基于视觉的操控任务。它构建了对背景变化和相机视角鲁棒的物体中心三维表示,并利用基于Transformer的策略对这些表示进行推理。此外,我们引入一种分割对应模型,使得策略能够在测试阶段泛化到新物体。通过全面实验,我们验证了GROOT策略在模拟和真实环境中对感知变化的鲁棒性。GROOT在背景变化、相机视角移动以及新物体实例出现时的泛化性能表现优异,而现有的端到端学习方法和基于物体提案的方法均存在不足。我们还在真实机器人上对GROOT策略进行了广泛评估,展示了其在极端设置变化下的有效性。更多视频和模型细节见附录及项目网站:https://ut-austin-rpl.github.io/GROOT。