Deep learning methods for perception are the cornerstone of many robotic systems. Despite their potential for impressive performance, obtaining real-world training data is expensive, and can be impractically difficult for some tasks. Sim-to-real transfer with domain randomization offers a potential workaround, but often requires extensive manual tuning and results in models that are brittle to distribution shift between sim and real. In this work, we introduce Composable Object Volume NeRF (COV-NeRF), an object-composable NeRF model that is the centerpiece of a real-to-sim pipeline for synthesizing training data targeted to scenes and objects from the real world. COV-NeRF extracts objects from real images and composes them into new scenes, generating photorealistic renderings and many types of 2D and 3D supervision, including depth maps, segmentation masks, and meshes. We show that COV-NeRF matches the rendering quality of modern NeRF methods, and can be used to rapidly close the sim-to-real gap across a variety of perceptual modalities.
翻译:基于深度学习的感知方法是许多机器人系统的核心。尽管其性能潜力令人瞩目,但获取真实世界训练数据成本高昂,且对于某些任务而言可能难以在实际中实现。通过域随机化进行仿真到现实的迁移提供了一种潜在的替代方案,但通常需要大量手动调参,并且生成的模型对仿真与现实之间分布偏移的鲁棒性较差。本文提出可组合物体体积NeRF(COV-NeRF),这是一种以物体可组合的NeRF模型为核心的“现实到仿真”流程,用于合成针对真实世界场景与物体的训练数据。COV-NeRF从真实图像中提取物体,并将其组合到新场景中,生成逼真的渲染结果以及多种二维和三维监督信息(包括深度图、分割掩膜和网格)。实验表明,COV-NeRF在渲染质量上可与现代NeRF方法媲美,并能快速弥合多种感知模态下仿真到现实之间的差距。