Driving scene manipulation using real-world sensor data has emerged as a promising alternative to traditional driving simulators. Despite advances in language control and neural scene representations, existing methods treat grounding, editing, and simulation as loosely connected stages, relying on heuristic object localization, manual guidance, and single-agent validation, thereby constraining semantic expressiveness and hindering scalable, reactive scenario generation. We introduce SIMSplat, a driving scene editor built on scene-graph-based 4D Gaussian Splatting augmented with language-aligned features. By embedding appearance, motion, and location semantics directly into Gaussian scene-graph nodes, SIMSplat makes reconstructed scenes queryable through free-form natural language, bridging language understanding to object-level editing and multi-agent simulation within a single framework. Building on this language-grounded scene graph, SIMSplat supports diverse edits including fine-grained pedestrian manipulation, while a multi-agent path refinement module propagates changes across all agents to ensure reactive, physically plausible simulations. The pipeline further integrates with Vision-Language Models for automated scenario mining. Experiments show that SIMSplat more than doubles baseline grounding accuracy, achieves the highest task completion rate, and produces the lowest failure rates across diverse driving scenarios.
翻译:暂无翻译