GINA-3D: Learning to Generate Implicit Neural Assets in the Wild

Modeling the 3D world from sensor data for simulation is a scalable way of developing testing and validation environments for robotic learning problems such as autonomous driving. However, manually creating or re-creating real-world-like environments is difficult, expensive, and not scalable. Recent generative model techniques have shown promising progress to address such challenges by learning 3D assets using only plentiful 2D images -- but still suffer limitations as they leverage either human-curated image datasets or renderings from manually-created synthetic 3D environments. In this paper, we introduce GINA-3D, a generative model that uses real-world driving data from camera and LiDAR sensors to create realistic 3D implicit neural assets of diverse vehicles and pedestrians. Compared to the existing image datasets, the real-world driving setting poses new challenges due to occlusions, lighting-variations and long-tail distributions. GINA-3D tackles these challenges by decoupling representation learning and generative modeling into two stages with a learned tri-plane latent structure, inspired by recent advances in generative modeling of images. To evaluate our approach, we construct a large-scale object-centric dataset containing over 520K images of vehicles and pedestrians from the Waymo Open Dataset, and a new set of 80K images of long-tail instances such as construction equipment, garbage trucks, and cable cars. We compare our model with existing approaches and demonstrate that it achieves state-of-the-art performance in quality and diversity for both generated images and geometries.

翻译：从传感器数据建模三维世界以进行仿真，是构建自动驾驶等机器人学习问题测试与验证环境的可扩展方法。然而，手动创建或重新生成类真实环境既困难、昂贵且难以扩展。近期生成模型技术通过仅利用大量二维图像学习三维资产，在应对此类挑战方面展现出可喜进展——但仍存在局限性，因其依赖于人工精选图像数据集或从手动创建的合成三维环境中渲染的结果。本文提出GINA-3D，一种利用来自摄像头和激光雷达传感器的真实驾驶数据生成多样化车辆与行人的三维隐式神经资产的生成模型。与现有图像数据集相比，真实驾驶场景因存在遮挡、光照变化和长尾分布而带来新挑战。受图像生成建模领域最新进展启发，GINA-3D通过将表征学习与生成建模解耦为两阶段，并采用学习到的三平面隐式结构来应对这些挑战。为评估方法，我们基于Waymo开放数据集构建了包含超过52万张车辆与行人图像的大规模以物体为中心的数据集，以及8万张长尾实例（如工程设备、垃圾车、缆车）的新数据集。我们将模型与现有方法进行比较，证明其在生成图像与几何体的质量和多样性上均达到最先进水平。