Novel view synthesis (NVS) of static and dynamic urban scenes is essential for autonomous driving simulation, yet existing methods often struggle to balance reconstruction time with quality. While state-of-the-art neural radiance fields and 3D Gaussian Splatting approaches achieve photorealism, they often rely on time-consuming per-scene optimization. Conversely, emerging feed-forward methods frequently adopt per-pixel Gaussian representations, which lead to 3D inconsistencies when aggregating multi-view predictions in complex, dynamic environments. We propose EvolSplat4D, a feed-forward framework that moves beyond existing per-pixel paradigms by unifying volume-based and pixel-based Gaussian prediction across three specialized branches. For close-range static regions, we predict consistent geometry of 3D Gaussians over multiple frames directly from a 3D feature volume, complemented by a semantically-enhanced image-based rendering module for predicting their appearance. For dynamic actors, we utilize object-centric canonical spaces and a motion-adjusted rendering module to aggregate temporal features, ensuring stable 4D reconstruction despite noisy motion priors. Far-Field scenery is handled by an efficient per-pixel Gaussian branch to ensure full-scene coverage. Experimental results on the KITTI-360, KITTI, Waymo, and PandaSet datasets show that EvolSplat4D reconstructs both static and dynamic environments with superior accuracy and consistency, outperforming both per-scene optimization and state-of-the-art feed-forward baselines.
翻译:静态与动态城市场景的新视角合成对于自动驾驶仿真至关重要,然而现有方法往往难以在重建时间与质量之间取得平衡。虽然最先进的神经辐射场和3D高斯泼溅方法能够实现照片级真实感,但它们通常依赖于耗时的逐场景优化。相反,新兴的前馈方法常采用逐像素高斯表示,这在复杂动态环境中聚合多视角预测时会导致3D不一致性。我们提出了EVolSplat4D,一种前馈框架,它通过三个专用分支统一基于体积和基于像素的高斯预测,超越了现有的逐像素范式。对于近景静态区域,我们直接从3D特征体积中预测多帧间一致的3D高斯几何,并辅以一个语义增强的基于图像的渲染模块来预测其外观。对于动态参与者,我们利用以对象为中心的规范空间和运动调整渲染模块来聚合时序特征,从而在存在噪声运动先验的情况下确保稳定的4D重建。远景场景则由高效的逐像素高斯分支处理,以保证全场景覆盖。在KITTI-360、KITTI、Waymo和PandaSet数据集上的实验结果表明,EVolSplat4D能够以更高的精度和一致性重建静态与动态环境,其性能优于逐场景优化方法以及最先进的前馈基线。