Modeling the 3D world from sensor data for simulation is a scalable way of developing testing and validation environments for robotic learning problems such as autonomous driving. However, manually creating or re-creating real-world-like environments is difficult, expensive, and not scalable. Recent generative model techniques have shown promising progress to address such challenges by learning 3D assets using only plentiful 2D images -- but still suffer limitations as they leverage either human-curated image datasets or renderings from manually-created synthetic 3D environments. In this paper, we introduce GINA-3D, a generative model that uses real-world driving data from camera and LiDAR sensors to create realistic 3D implicit neural assets of diverse vehicles and pedestrians. Compared to the existing image datasets, the real-world driving setting poses new challenges due to occlusions, lighting-variations and long-tail distributions. GINA-3D tackles these challenges by decoupling representation learning and generative modeling into two stages with a learned tri-plane latent structure, inspired by recent advances in generative modeling of images. To evaluate our approach, we construct a large-scale object-centric dataset containing over 1.2M images of vehicles and pedestrians from the Waymo Open Dataset, and a new set of 80K images of long-tail instances such as construction equipment, garbage trucks, and cable cars. We compare our model with existing approaches and demonstrate that it achieves state-of-the-art performance in quality and diversity for both generated images and geometries.
翻译:利用传感器数据进行3D世界建模是实现自动驾驶等机器人学习问题测试与验证环境可扩展开发的一种方式。然而,手动创建或复现真实世界场景既困难且成本高昂,缺乏可扩展性。近期生成模型技术通过学习仅需大量2D图像即可生成3D资产,在应对此类挑战方面展现出可喜进展——但仍存在局限性,因其依赖的是人工策划的图像数据集或手动创建的合成3D环境渲染图。本文提出GINA-3D,一种利用来自摄像头和激光雷达传感器的真实驾驶数据,生成多样化车辆与行人的逼真3D隐式神经资产的生成模型。相较于现有图像数据集,真实驾驶场景因遮挡、光照变化及长尾分布等问题带来了新挑战。受近期图像生成建模研究进展启发,GINA-3D通过将表征学习与生成建模解耦为两阶段架构(采用学习得到的三平面隐结构)来应对这些挑战。为评估本方法,我们基于Waymo开放数据集构建了一个包含超过120万张车辆与行人图像的大规模目标中心数据集,并新增8万张涵盖工程设备、垃圾车及缆车等长尾实例的图像集。我们将本模型与现有方法进行对比,实验证明其在生成图像与几何体的质量和多样性方面均达到最优性能。