We present InfiniCube, a scalable method for generating unbounded dynamic 3D driving scenes with high fidelity and controllability. Previous methods for scene generation either suffer from limited scales or lack geometric and appearance consistency along generated sequences. In contrast, we leverage the recent advancements in scalable 3D representation and video models to achieve large dynamic scene generation that allows flexible controls through HD maps, vehicle bounding boxes, and text descriptions. First, we construct a map-conditioned sparse-voxel-based 3D generative model to unleash its power for unbounded voxel world generation. Then, we re-purpose a video model and ground it on the voxel world through a set of carefully designed pixel-aligned guidance buffers, synthesizing a consistent appearance. Finally, we propose a fast feed-forward approach that employs both voxel and pixel branches to lift the dynamic videos to dynamic 3D Gaussians with controllable objects. Our method can generate controllable and realistic 3D driving scenes, and extensive experiments validate the effectiveness and superiority of our model.
翻译:本文提出InfiniCube,一种可扩展的用于生成高保真度且可控的无界动态三维驾驶场景的方法。现有的场景生成方法往往受限于生成规模,或在生成序列中缺乏几何与外观的一致性。相比之下,我们利用近期在可扩展三维表示与视频模型方面的进展,实现了支持通过高清地图、车辆边界框和文本描述进行灵活控制的大规模动态场景生成。首先,我们构建了一个基于地图条件稀疏体素的三维生成模型,以释放其在无界体素世界生成方面的能力。随后,我们重新利用一个视频模型,并通过一组精心设计的像素对齐引导缓冲区将其锚定于体素世界,从而合成具有一致性的外观。最后,我们提出一种快速前馈方法,该方法同时利用体素分支与像素分支,将动态视频提升为具有可控对象的动态三维高斯表示。我们的方法能够生成可控且逼真的三维驾驶场景,大量实验验证了模型的有效性与优越性。