Implicit neural representation has demonstrated promising results in view synthesis for large and complex scenes. However, existing approaches either fail to capture the fast-moving objects or need to build the scene graph without camera ego-motions, leading to low-quality synthesized views of the scene. We aim to jointly solve the view synthesis problem of large-scale urban scenes and fast-moving vehicles, which is more practical and challenging. To this end, we first leverage a graph structure to learn the local scene representations of dynamic objects and the background. Then, we design a progressive scheme that dynamically allocates a new local scene graph trained with frames within a temporal window, allowing us to scale up the representation to an arbitrarily large scene. Besides, the training views of urban scenes are relatively sparse, which leads to a significant decline in reconstruction accuracy for dynamic objects. Therefore, we design a frequency auto-encoder network to encode the latent code and regularize the frequency range of objects, which can enhance the representation of dynamic objects and address the issue of sparse image inputs. Additionally, we employ lidar point projection to maintain geometry consistency in large-scale urban scenes. Experimental results demonstrate that our method achieves state-of-the-art view synthesis accuracy, object manipulation, and scene roaming ability. The code will be open-sourced upon paper acceptance.
翻译:隐式神经表示在大规模复杂场景的视图合成中已展现出令人瞩目的成果。然而,现有方法要么无法捕捉快速移动的物体,要么需要在无相机自运动的情况下构建场景图,导致合成视图质量低下。本文旨在联合解决大规模城市场景与快速移动车辆的视图合成问题,这一任务更具实际意义与挑战性。为此,我们首先利用图结构学习动态物体与背景的局部场景表示。随后设计了一种渐进式方案,动态分配新的局部场景图,并使用时间窗口内的帧进行训练,使表示能力可扩展至任意大规模场景。此外,城市场景的训练视图相对稀疏,导致动态物体的重建精度显著下降。因此,我们设计了一个频率自编码器网络,用于编码潜在特征并正则化物体的频率范围,从而增强动态物体的表示能力,并解决稀疏图像输入问题。同时,我们采用激光点云投影保持大规模城市场景的几何一致性。实验结果表明,我们的方法在视图合成精度、物体操控与场景漫游能力上均达到了最优水平。代码将在论文接收后开源。