GINA-3D: Learning to Generate Implicit Neural Assets in the Wild

Modeling the 3D world from sensor data for simulation is a scalable way of developing testing and validation environments for robotic learning problems such as autonomous driving. However, manually creating or re-creating real-world-like environments is difficult, expensive, and not scalable. Recent generative model techniques have shown promising progress to address such challenges by learning 3D assets using only plentiful 2D images -- but still suffer limitations as they leverage either human-curated image datasets or renderings from manually-created synthetic 3D environments. In this paper, we introduce GINA-3D, a generative model that uses real-world driving data from camera and LiDAR sensors to create realistic 3D implicit neural assets of diverse vehicles and pedestrians. Compared to the existing image datasets, the real-world driving setting poses new challenges due to occlusions, lighting-variations and long-tail distributions. GINA-3D tackles these challenges by decoupling representation learning and generative modeling into two stages with a learned tri-plane latent structure, inspired by recent advances in generative modeling of images. To evaluate our approach, we construct a large-scale object-centric dataset containing over 1.2M images of vehicles and pedestrians from the Waymo Open Dataset, and a new set of 80K images of long-tail instances such as construction equipment, garbage trucks, and cable cars. We compare our model with existing approaches and demonstrate that it achieves state-of-the-art performance in quality and diversity for both generated images and geometries.

翻译：利用传感器数据进行3D世界建模是实现自动驾驶等机器人学习问题测试与验证环境可扩展开发的一种方式。然而，手动创建或复现真实世界场景既困难且成本高昂，缺乏可扩展性。近期生成模型技术通过学习仅需大量2D图像即可生成3D资产，在应对此类挑战方面展现出可喜进展——但仍存在局限性，因其依赖的是人工策划的图像数据集或手动创建的合成3D环境渲染图。本文提出GINA-3D，一种利用来自摄像头和激光雷达传感器的真实驾驶数据，生成多样化车辆与行人的逼真3D隐式神经资产的生成模型。相较于现有图像数据集，真实驾驶场景因遮挡、光照变化及长尾分布等问题带来了新挑战。受近期图像生成建模研究进展启发，GINA-3D通过将表征学习与生成建模解耦为两阶段架构（采用学习得到的三平面隐结构）来应对这些挑战。为评估本方法，我们基于Waymo开放数据集构建了一个包含超过120万张车辆与行人图像的大规模目标中心数据集，并新增8万张涵盖工程设备、垃圾车及缆车等长尾实例的图像集。我们将本模型与现有方法进行对比，实验证明其在生成图像与几何体的质量和多样性方面均达到最优性能。

相关内容

ASSETS

关注 0

ACM SIGACCESS Conference on Computers and Accessibility是为残疾人和老年人提供与计算机相关的设计、评估、使用和教育研究的首要论坛。我们欢迎提交原始的高质量的有关计算和可访问性的主题。今年，ASSETS首次将其范围扩大到包括关于计算机无障碍教育相关主题的原创高质量研究。官网链接：http://assets19.sigaccess.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日