Controllable driving scene generation is critical for realistic and scalable autonomous driving simulation, yet existing approaches struggle to jointly achieve photorealism and precise control. We introduce HorizonForge, a unified framework that reconstructs scenes as editable Gaussian Splats and Meshes, enabling fine-grained 3D manipulation and language-driven vehicle insertion. Edits are rendered through a noise-aware video diffusion process that enforces spatial and temporal consistency, producing diverse scene variations in a single feed-forward pass without per-trajectory optimization. To standardize evaluation, we further propose HorizonSuite, a comprehensive benchmark spanning ego- and agent-level editing tasks such as trajectory modifications and object manipulation. Extensive experiments show that Gaussian-Mesh representation delivers substantially higher fidelity than alternative 3D representations, and that temporal priors from video diffusion are essential for coherent synthesis. Combining these findings, HorizonForge establishes a simple yet powerful paradigm for photorealistic, controllable driving simulation, achieving an 83.4% user-preference gain and a 25.19% FID improvement over the second best state-of-the-art method. Project page: https://horizonforge.github.io/ .
翻译:可控驾驶场景生成对于实现真实且可扩展的自动驾驶仿真至关重要,然而现有方法难以同时达成照片级真实感与精确控制。我们提出了HorizonForge,一个统一框架,将场景重建为可编辑的高斯泼溅与网格,支持细粒度三维操控与语言驱动的车辆插入。编辑通过一个具备噪声感知的视频扩散过程进行渲染,该过程强制保持空间与时间一致性,仅需单次前向传播即可生成多样化的场景变体,无需针对每条轨迹进行优化。为了标准化评估,我们进一步提出了HorizonSuite,一个涵盖自车层面与智能体层面编辑任务(如轨迹修改与物体操控)的综合基准。大量实验表明,高斯-网格表示相比其他三维表示能提供显著更高的保真度,并且来自视频扩散的时间先验对于连贯的合成至关重要。综合这些发现,HorizonForge为照片级真实感、可控的驾驶仿真建立了一个简洁而强大的范式,相较于次优的现有最佳方法,实现了83.4%的用户偏好增益与25.19%的FID改进。项目页面:https://horizonforge.github.io/ 。