Implicit neural representation has demonstrated promising results in view synthesis for large and complex scenes. However, existing approaches either fail to capture the fast-moving objects or need to build the scene graph without camera ego-motions, leading to low-quality synthesized views of the scene. We aim to jointly solve the view synthesis problem of large-scale urban scenes and fast-moving vehicles, which is more practical and challenging. To this end, we first leverage a graph structure to learn the local scene representations of dynamic objects and the background. Then, we design a progressive scheme that dynamically allocates a new local scene graph trained with frames within a temporal window, allowing us to scale up the representation to an arbitrarily large scene. Besides, the training views of urban scenes are relatively sparse, which leads to a significant decline in reconstruction accuracy for dynamic objects. Therefore, we design a frequency auto-encoder network to encode the latent code and regularize the frequency range of objects, which can enhance the representation of dynamic objects and address the issue of sparse image inputs. Additionally, we employ lidar point projection to maintain geometry consistency in large-scale urban scenes. Experimental results demonstrate that our method achieves state-of-the-art view synthesis accuracy, object manipulation, and scene roaming ability. The code will be open-sourced upon paper acceptance.
翻译:隐式神经表示在大规模复杂场景的视图合成中已展现出令人期待的结果。然而,现有方法要么无法捕捉快速移动的物体,要么需要在不具备相机自身运动的情况下构建场景图,导致场景合成视图质量低下。我们旨在联合解决大规模城市场景与快速运动车辆的视图合成问题,这更具实用性和挑战性。为此,我们首先利用图结构学习动态物体与背景的局部场景表示。随后,我们设计了一种渐进式方案,动态分配新的局部场景图,并使用时间窗口内的帧进行训练,从而将表示扩展至任意大规模场景。此外,由于城市场景的训练视图相对稀疏,导致动态物体的重建精度显著下降。为此,我们设计了一种频率自编码器网络,用于编码潜码并规整物体的频率范围,从而增强动态物体的表示能力,并解决稀疏图像输入的问题。同时,我们利用激光雷达点投影来保持大规模城市场景的几何一致性。实验结果表明,我们的方法在视图合成精度、物体操作和场景漫游能力上达到了最先进水平。代码将在论文接收后开源。