SUDS: Scalable Urban Dynamic Scenes

We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes. Prior work tends to reconstruct single video clips of short durations (up to 10 seconds). Two reasons are that such methods (a) tend to scale linearly with the number of moving objects and input videos because a separate model is built for each and (b) tend to require supervision via 3D bounding boxes and panoptic labels, obtained manually or via category-specific models. As a step towards truly open-world reconstructions of dynamic cities, we introduce two key innovations: (a) we factorize the scene into three separate hash table data structures to efficiently encode static, dynamic, and far-field radiance fields, and (b) we make use of unlabeled target signals consisting of RGB images, sparse LiDAR, off-the-shelf self-supervised 2D descriptors, and most importantly, 2D optical flow. Operationalizing such inputs via photometric, geometric, and feature-metric reconstruction losses enables SUDS to decompose dynamic scenes into the static background, individual objects, and their motions. When combined with our multi-branch table representation, such reconstructions can be scaled to tens of thousands of objects across 1.2 million frames from 1700 videos spanning geospatial footprints of hundreds of kilometers, (to our knowledge) the largest dynamic NeRF built to date. We present qualitative initial results on a variety of tasks enabled by our representations, including novel-view synthesis of dynamic urban scenes, unsupervised 3D instance segmentation, and unsupervised 3D cuboid detection. To compare to prior work, we also evaluate on KITTI and Virtual KITTI 2, surpassing state-of-the-art methods that rely on ground truth 3D bounding box annotations while being 10x quicker to train.

翻译：我们扩展神经辐射场（NeRF），使其能够处理动态的大规模城市场景。先前的工作通常仅重建短时长（最多10秒）的单视频片段。原因有二：(a) 这类方法往往与移动物体和输入视频的数量呈线性扩展关系，因为需要为每个对象和视频构建独立模型；(b) 它们通常需要借助人工标注或特定类别模型获得的3D边界框和全景标签作为监督信号。为实现真正开放世界的动态城市重建，我们提出两项关键创新：(a) 将场景分解为三个独立哈希表数据结构，以高效编码静态、动态和远场辐射场；(b) 利用无标签目标信号，包括RGB图像、稀疏激光雷达、现成的自监督2D描述符，以及最重要的2D光流。通过光度损失、几何损失和特征度量重建损失来操作这些输入，SUDS能够将动态场景分解为静态背景、独立物体及其运动。结合我们的多分支表格表示，该重建方法可扩展至涵盖数百公里地理足迹的1700个视频中的120万帧、数万个物体——据我们所知，这是迄今构建的最大规模动态NeRF。我们展示了表示方法在多项任务上的定性初始结果，包括动态城市场景的新视角合成、无监督3D实例分割和无监督3D长方体检测。为与先前工作对比，我们还在KITTI和Virtual KITTI 2数据集上进行了评估，在训练速度提升10倍的同时，超越了依赖真实3D边界框标注的最先进方法。