We present a unified and compact scene representation for robotics, where each object in the scene is depicted by a latent code capturing geometry and appearance. This representation can be decoded for various tasks such as novel view rendering, 3D reconstruction (e.g. recovering depth, point clouds, or voxel maps), collision checking, and stable grasp prediction. We build our representation from a single RGB input image at test time by leveraging recent advances in Neural Radiance Fields (NeRF) that learn category-level priors on large multiview datasets, then fine-tune on novel objects from one or few views. We expand the NeRF model for additional grasp outputs and explore ways to leverage this representation for robotics. At test-time, we build the representation from a single RGB input image observing the scene from only one viewpoint. We find that the recovered representation allows rendering from novel views, including of occluded object parts, and also for predicting successful stable grasps. Grasp poses can be directly decoded from our latent representation with an implicit grasp decoder. We experimented in both simulation and real world and demonstrated the capability for robust robotic grasping using such compact representation. Website: https://nerfgrasp.github.io
翻译:我们提出了一种用于机器人领域的统一紧凑场景表示,其中场景中每个物体通过捕获几何与外观的潜在编码来描述。该表示可解码用于多种任务,例如新视角渲染、三维重建(如恢复深度、点云或体素地图)、碰撞检测及稳定抓取预测。我们在测试阶段仅通过单张RGB输入图像构建该表示,其方法利用了神经辐射场(NeRF)的最新进展——在大规模多视角数据集上学习类别级先验,然后对单视角或少数视角下的新物体进行微调。我们扩展了NeRF模型以额外输出抓取信息,并探索了利用该表示进行机器人操作的多种方式。在测试时,我们仅从单一视角观察场景的单张RGB输入图像构建该表示。我们发现,重建的表示不仅支持新视角渲染(包括被遮挡物体部分),还能预测成功且稳定的抓取。抓取姿态可通过隐式抓取解码器直接从潜在表示中解码。我们在仿真和真实环境中进行了实验,验证了利用此类紧凑表示实现鲁棒机器人抓取的能力。网站:https://nerfgrasp.github.io