This paper aims to tackle the problem of photorealistic view synthesis from vehicle sensor data. Recent advancements in neural scene representation have achieved notable success in rendering high-quality autonomous driving scenes, but the performance significantly degrades as the viewpoint deviates from the training trajectory. To mitigate this problem, we introduce StreetCrafter, a novel controllable video diffusion model that utilizes LiDAR point cloud renderings as pixel-level conditions, which fully exploits the generative prior for novel view synthesis, while preserving precise camera control. Moreover, the utilization of pixel-level LiDAR conditions allows us to make accurate pixel-level edits to target scenes. In addition, the generative prior of StreetCrafter can be effectively incorporated into dynamic scene representations to achieve real-time rendering. Experiments on Waymo Open Dataset and PandaSet demonstrate that our model enables flexible control over viewpoint changes, enlarging the view synthesis regions for satisfying rendering, which outperforms existing methods.
翻译:本文旨在解决基于车辆传感器数据的逼真视图合成问题。神经场景表征的最新进展在高质量自动驾驶场景渲染方面取得了显著成功,但当视点偏离训练轨迹时,其性能会显著下降。为缓解此问题,我们提出了StreetCrafter——一种新型可控视频扩散模型,该模型利用LiDAR点云渲染作为像素级条件,在保持精确相机控制的同时,充分挖掘生成先验以实现新视角合成。此外,像素级LiDAR条件的使用使我们能够对目标场景进行精确的像素级编辑。同时,StreetCrafter的生成先验可有效整合到动态场景表征中,从而实现实时渲染。在Waymo Open Dataset和PandaSet上的实验表明,我们的模型能够灵活控制视点变化,扩展视图合成区域以获得令人满意的渲染效果,其性能优于现有方法。