Many multimodal learning tasks require supervision that remains consistent across edits, viewpoints, and scene-level interventions. However, such supervision is difficult to obtain from observation-level datasets, which do not expose the underlying scene state or how changes propagate through it. We present SceneForge, an intervention-driven framework that generates structured supervision from editable 3D world states. SceneForge represents each scene as a persistent world with semantic, geometric, and physical dependencies. By applying explicit interventions (e.g., object removal or camera variation) and propagating their effects through scene dependencies, SceneForge renders supervision that remains consistent with object structure and scene-level effects. This produces aligned outputs including counterfactual observations, multi-view observations, and effect-aware signals such as shadows and reflections, all derived from a shared world state rather than post hoc image-space processing. We instantiate SceneForge using Infinigen and Blender to construct a licensing-clean indoor supervision resource with a large number of counterfactual pairs and aligned annotations from over 2K scenes, covering both diverse single-view and registered multi-view settings. Under matched training budgets, incorporating SceneForge supervision improves both object removal and scene removal performance across multiple benchmarks in both quantitative and qualitative evaluation. These results indicate that modeling supervision as structured state transitions in editable worlds provides a practical and scalable foundation for intervention-consistent multimodal learning.
翻译:许多多模态学习任务需要一种在编辑、视角和场景级干预下保持一致的监督信号。然而,这种监督难以从观察级数据集中获取,因为后者并未暴露底层场景状态或变化在其中的传播方式。我们提出SceneForge,一个基于干预的框架,该框架从可编辑的三维世界状态生成结构化监督。SceneForge将每个场景表示为一个具有语义、几何和物理依赖关系的持久化世界。通过施加显式干预(例如物体移除或相机视角变化)并沿场景依赖关系传播其效应,SceneForge能够生成与物体结构和场景级效果保持一致的监督信号。这产生了一组对齐的输出,包括反事实观测、多视角观测以及阴影和反射等感知效应信号,所有这些均源自共享的世界状态,而非事后图像空间处理。我们利用Infinigen和Blender实例化SceneForge,构建了一个许可证干净的室内监督资源库,包含大量反事实对及来自2000余个场景的对齐标注,覆盖了多样化的单视角和注册多视角设置。在匹配训练预算的条件下,融入SceneForge监督在多个基准测试中提升了物体移除和场景移除的性能(定量和定性评估)。这些结果表明,在可编辑世界中将监督建模为结构化状态变迁,为干预一致的多模态学习提供了实用且可扩展的基础。