HorizonWeaver: Generalizable Multi-Level Semantic Editing for Driving Scenes

Ensuring safety in autonomous driving requires scalable generation of realistic, controllable driving scenes beyond what real-world testing provides. Yet existing instruction guided image editors, trained on object-centric or artistic data, struggle with dense, safety-critical driving layouts. We propose HorizonWeaver, which tackles three fundamental challenges in driving scene editing: (1) multi-level granularity, requiring coherent object- and scene-level edits in dense environments; (2) rich high-level semantics, preserving diverse objects while following detailed instructions; and (3) ubiquitous domain shifts, handling changes in climate, layout, and traffic across unseen environments. The core of HorizonWeaver is a set of complementary contributions across data, model, and training: (1) Data: Large-scale dataset generation, where we build a paired real/synthetic dataset from Boreas, nuScenes, and Argoverse2 to improve generalization; (2) Model: Language-Guided Masks for fine-grained editing, where semantics-enriched masks and prompts enable precise, language-guided edits; and (3) Training: Content preservation and instruction alignment, where joint losses enforce scene consistency and instruction fidelity. Together, HorizonWeaver provides a scalable framework for photorealistic, instruction-driven editing of complex driving scenes, collecting 255K images across 13 editing categories and outperforming prior methods in L1, CLIP, and DINO metrics, achieving +46.4% user preference and improving BEV segmentation IoU by +33%. Project page: https://msoroco.github.io/horizonweaver/

翻译：确保自动驾驶安全性需要超越真实世界测试的可扩展生成与可控驾驶场景。然而，现有基于指令引导的图像编辑模型（训练于物体中心或艺术类数据）在处理密集且关乎安全的关键驾驶布局时效果不佳。我们提出HorizonWeaver，该方法应对驾驶场景编辑中的三个核心挑战：（1）多层级粒度，要求在密集环境中实现物体级与场景级编辑的连贯性；（2）丰富的高层语义，在遵循详细指令的同时保留多样化的物体信息；（3）泛化的领域偏移，处理未知环境中的气候、布局及交通变化。HorizonWeaver的核心是通过数据、模型与训练三方面的互补贡献实现：（1）数据：大规模数据集生成——我们基于Boreas、nuScenes及Argoverse2构建成对的真实/合成数据集以提升泛化能力；（2）模型：语言引导掩码实现细粒度编辑——通过语义增强掩码与提示实现精确的语言引导编辑；（3）训练：内容保留与指令对齐——联合损失函数确保场景一致性与指令保真度。综上，HorizonWeaver为复杂驾驶场景提供了可扩展的逼真指令驱动编辑框架，收集了涵盖13类编辑任务的255K张图像，在L1、CLIP及DINO指标上超越先前方法，实现用户偏好度提升46.4%，BEV分割IoU提升33%。项目页面：https://msoroco.github.io/horizonweaver/