Implicit neural representation has demonstrated promising results in 3D reconstruction on various scenes. However, existing approaches either struggle to model fast-moving objects or are incapable of handling large-scale camera ego-motions in urban environments. This leads to low-quality synthesized views of the large-scale urban scenes. In this paper, we aim to jointly solve the problems caused by large-scale scenes and fast-moving vehicles, which are more practical and challenging. To this end, we propose a progressive scene graph network architecture to learn the local scene representations of dynamic objects and global urban scenes. The progressive learning architecture dynamically allocates a new local scene graph trained on frames within a temporal window, with the window size automatically determined, allowing us to scale up the representation to arbitrarily large scenes. Besides, according to our observations, the training views of dynamic objects are relatively sparse according to rapid movements, which leads to a significant decline in reconstruction accuracy for dynamic objects. Therefore, we utilize a foundation model network to encode the latent code. Specifically, we leverage the generalization capability of the visual foundation model DINOv2 to extract appearance and shape codes, and train the network on a large-scale urban scene object dataset to enhance its prior modeling ability for handling sparse-view dynamic inputs. In parallel, we introduce a frequency-modulated module that regularizes the frequency spectrum of objects, thereby addressing the challenge of modeling sparse image inputs from a frequency-domain perspective. Experimental results demonstrate that our method achieves state-of-the-art view synthesis accuracy, object manipulation, and scene roaming ability in various scenes.
翻译:隐式神经表征已在多种场景的三维重建中展现出有前景的结果。然而,现有方法要么难以对快速移动的物体进行建模,要么无法处理城市环境中大规模相机自运动。这导致大规模城市场景的合成视图质量低下。本文旨在联合解决由大规模场景和快速移动车辆引起的问题,这些问题更具实际性与挑战性。为此,我们提出一种渐进式场景图网络架构,以学习动态物体的局部场景表征和全局城市场景表征。该渐进式学习架构动态分配一个新的局部场景图,该场景图在时间窗口内的帧上进行训练,窗口大小自动确定,从而使我们能够将表征扩展到任意大的场景。此外,根据我们的观察,由于快速运动,动态物体的训练视图相对稀疏,这导致动态物体的重建精度显著下降。因此,我们利用一个基础模型网络来编码潜在特征。具体而言,我们借助视觉基础模型DINOv2的泛化能力提取外观和形状特征,并在大规模城市场景物体数据集上训练该网络,以增强其处理稀疏视图动态输入的先验建模能力。同时,我们引入一个频率调制模块,该模块对物体的频谱进行正则化,从而从频域角度应对稀疏图像输入建模的挑战。实验结果表明,我们的方法在各种场景中实现了最先进的视图合成精度、物体操控能力和场景漫游能力。