The task of separating dynamic objects from static environments using NeRFs has been widely studied in recent years. However, capturing large-scale scenes still poses a challenge due to their complex geometric structures and unconstrained dynamics. Without the help of 3D motion cues, previous methods often require simplified setups with slow camera motion and only a few/single dynamic actors, leading to suboptimal solutions in most urban setups. To overcome such limitations, we present RoDUS, a pipeline for decomposing static and dynamic elements in urban scenes, with thoughtfully separated NeRF models for moving and non-moving components. Our approach utilizes a robust kernel-based initialization coupled with 4D semantic information to selectively guide the learning process. This strategy enables accurate capturing of the dynamics in the scene, resulting in reduced artifacts caused by NeRF on background reconstruction, all by using self-supervision. Notably, experimental evaluations on KITTI-360 and Pandaset datasets demonstrate the effectiveness of our method in decomposing challenging urban scenes into precise static and dynamic components.
翻译:近年来,利用神经辐射场(NeRF)从静态环境中分离动态对象的研究已被广泛关注。然而,由于大规模场景复杂的几何结构和不受约束的动力学特性,其捕捉仍构成挑战。在缺乏三维运动线索的情况下,先前方法通常需要简化设置,如缓慢的相机运动及仅包含少量/单一动态角色,导致在大多数城市场景中效果欠佳。为克服这些局限,我们提出RoDUS——一种针对城市场景的静态与动态元素分解流程,通过精心分离的NeRF模型分别处理运动与非运动组件。本方法利用基于稳健核的初始化策略,结合四维语义信息有选择地引导学习过程。该策略能够精确捕捉场景中的动态信息,在仅使用自监督的情况下,减少因NeRF导致背景重建伪影的问题。值得注意的是,在KITTI-360和Pandaset数据集上的实验评估表明,本方法能有效将具有挑战性的城市场景分解为精确的静态与动态成分。