This paper proposes AutoScape, a long-horizon driving scene generation framework. At its core is a novel RGB-D diffusion model that iteratively generates sparse, geometrically consistent keyframes, serving as reliable anchors for the scene's appearance and geometry. To maintain long-range geometric consistency, the model 1) jointly handles image and depth in a shared latent space, 2) explicitly conditions on the existing scene geometry (i.e., rendered point clouds) from previously generated keyframes, and 3) steers the sampling process with a warp-consistent guidance. Given high-quality RGB-D keyframes, a video diffusion model then interpolates between them to produce dense and coherent video frames. AutoScape generates realistic and geometrically consistent driving videos of over 20 seconds, improving the long-horizon FID and FVD scores over the prior state-of-the-art by 48.6\% and 43.0\%, respectively.
翻译:本文提出AutoScape,一种长时域驾驶场景生成框架。其核心是一种新颖的RGB-D扩散模型,该模型通过迭代方式生成稀疏且几何一致的关键帧,作为场景外观与几何结构的可靠锚点。为保持长距离几何一致性,该模型具备以下特性:1)在共享隐空间中联合处理图像与深度信息;2)显式地以先前生成关键帧的现有场景几何(即渲染点云)为条件;3)通过形变一致引导机制控制采样过程。在获得高质量RGB-D关键帧后,视频扩散模型将在关键帧之间进行插值,以生成稠密且连贯的视频帧序列。AutoScape能够生成超过20秒的真实且几何一致的驾驶视频,在长时域FID与FVD指标上分别较现有最优方法提升48.6%和43.0%。