Recent advancements in 3D object generation using diffusion models have achieved remarkable success, but generating realistic 3D urban scenes remains challenging. Existing methods relying solely on 3D diffusion models tend to suffer a degradation in appearance details, while those utilizing only 2D diffusion models typically compromise camera controllability. To overcome this limitation, we propose ScenDi, a method for urban scene generation that integrates both 3D and 2D diffusion models. We first train a 3D latent diffusion model to generate 3D Gaussians, enabling the rendering of images at a relatively low resolution. To enable controllable synthesis, this 3DGS generation process can be optionally conditioned by specifying inputs such as 3d bounding boxes, road maps, or text prompts. Then, we train a 2D video diffusion model to enhance appearance details conditioned on rendered images from the 3D Gaussians. By leveraging the coarse 3D scene as guidance for 2D video diffusion, ScenDi generates desired scenes based on input conditions and successfully adheres to accurate camera trajectories. Experiments on two challenging real-world datasets, Waymo and KITTI-360, demonstrate the effectiveness of our approach.
翻译:利用扩散模型进行三维物体生成的最新进展已取得显著成功,但生成逼真的三维城市场景仍具挑战性。现有方法若仅依赖三维扩散模型,往往在表观细节上出现退化;而仅使用二维扩散模型的方法则通常牺牲了相机可控性。为克服这一局限,我们提出 ScenDi,一种融合三维与二维扩散模型的城市场景生成方法。我们首先训练一个三维潜在扩散模型以生成三维高斯分布,从而能够以相对较低的分辨率渲染图像。为实现可控合成,该三维高斯生成过程可选择性地通过指定三维边界框、道路地图或文本提示等输入进行条件控制。随后,我们训练一个二维视频扩散模型,以三维高斯渲染图像为条件增强表观细节。通过将粗糙的三维场景作为二维视频扩散的引导,ScenDi 能够基于输入条件生成所需场景,并成功遵循精确的相机轨迹。在 Waymo 和 KITTI-360 两个具有挑战性的真实世界数据集上的实验验证了我们方法的有效性。