We present EmerNeRF, a simple yet powerful approach for learning spatial-temporal representations of dynamic driving scenes. Grounded in neural fields, EmerNeRF simultaneously captures scene geometry, appearance, motion, and semantics via self-bootstrapping. EmerNeRF hinges upon two core components: First, it stratifies scenes into static and dynamic fields. This decomposition emerges purely from self-supervision, enabling our model to learn from general, in-the-wild data sources. Second, EmerNeRF parameterizes an induced flow field from the dynamic field and uses this flow field to further aggregate multi-frame features, amplifying the rendering precision of dynamic objects. Coupling these three fields (static, dynamic, and flow) enables EmerNeRF to represent highly-dynamic scenes self-sufficiently, without relying on ground truth object annotations or pre-trained models for dynamic object segmentation or optical flow estimation. Our method achieves state-of-the-art performance in sensor simulation, significantly outperforming previous methods when reconstructing static (+2.93 PSNR) and dynamic (+3.70 PSNR) scenes. In addition, to bolster EmerNeRF's semantic generalization, we lift 2D visual foundation model features into 4D space-time and address a general positional bias in modern Transformers, significantly boosting 3D perception performance (e.g., 37.50% relative improvement in occupancy prediction accuracy on average). Finally, we construct a diverse and challenging 120-sequence dataset to benchmark neural fields under extreme and highly-dynamic settings.
翻译:我们提出了EmerNeRF——一种简单而强大的动态驾驶场景时空表征学习方法。该方法以神经场为基础,通过自引导机制同时捕捉场景几何、外观、运动和语义信息。EmerNeRF的核心包含两个关键组件:第一,将场景分层为静态场与动态场,这种分解完全源自自监督学习,使模型能够从通用的自然数据源中学习;第二,从动态场参数化诱导流场,并利用该流场进一步聚合多帧特征以增强动态物体的渲染精度。通过耦合静态场、动态场和流场这三个场,EmerNeRF能够完全自足地表征高度动态场景,无需依赖真实目标标注或针对动态目标分割/光流估计的预训练模型。该方法在传感器仿真中达到了最先进水平,在重建静态(PSNR+2.93)和动态(PSNR+3.70)场景时显著超越先前方法。此外,为增强EmerNeRF的语义泛化能力,我们将二维视觉基础模型特征提升至四维时空,并解决了现代Transformer中普遍存在的定位偏置问题,使三维感知性能显著提升(例如平均占用预测准确率相对提升37.50%)。最后,我们构建了一个包含120个序列的多样化高挑战性数据集,用于评估神经场在极端动态场景下的表现。