Scalable generation of outdoor driving scenes requires 3D representations that remain consistent across multiple viewpoints and scale to large areas. Existing solutions either rely on image or video generative models distilled to 3D space, harming the geometric coherence and restricting the rendering to training views, or are limited to small-scale 3D scene or object-centric generation. In this work, we propose a 3D generative framework based on $Σ$-Voxfield grid, a discrete representation where each occupied voxel stores a fixed number of colorized surface samples. To generate this representation, we train a semantic-conditioned diffusion model that operates on local voxel neighborhoods and uses 3D positional encodings to capture spatial structure. We scale to large scenes via progressive spatial outpainting over overlapping regions. Finally, we render the generated $Σ$-Voxfield grid with a deferred rendering module to obtain photorealistic images, enabling large-scale multiview-consistent 3D scene generation without per-scene optimization. Extensive experiments show that our approach can generate diverse large-scale urban outdoor scenes, renderable into photorealistic images with various sensor configurations and camera trajectories while maintaining moderate computation cost compared to existing approaches.
翻译:大型户外驾驶场景的可扩展生成需要一种三维表示,该表示在多视角下保持一致性,并可扩展至大面积区域。现有方案要么依赖经蒸馏至三维空间的图像或视频生成模型,这会破坏几何一致性并将渲染限制于训练视角,要么局限于小规模三维场景或物体为中心的生成。本文提出了一种基于Σ-体素场网格(Σ-Voxfield grid)的三维生成框架,这是一种离散表示,其中每个占据体素存储固定数量的着色表面样本。为生成该表示,我们训练了一个语义条件扩散模型,该模型在局部体素邻域上运行,并使用三维位置编码捕捉空间结构。我们通过渐进式空间外推重叠区域实现大规模场景扩展。最终,利用延迟渲染模块渲染生成的Σ-体素场网格以获得逼真图像,从而无需逐场景优化即可实现大规模多视角一致的三维场景生成。大量实验表明,与现有方法相比,我们的方法能生成多样化的城市户外大规模场景,可渲染为具有多种传感器配置和相机轨迹的逼真图像,同时保持适中的计算成本。