Generative world models offer a compelling foundation for augmented-reality (AR) applications: by predicting future image sequences that incorporate deliberate visual edits, they enable temporally coherent, augmented future frames that can be computed ahead of time and cached, avoiding per-frame rendering from scratch in real time. In this work, we present SEGAR, a preliminary framework that combines a diffusion-based world model with a selective correction stage to support this vision. The world model generates augmented future frames with region-specific edits while preserving others, and the correction stage subsequently aligns safety-critical regions with real-world observations while preserving intended augmentations elsewhere. We demonstrate this pipeline in driving scenarios as a representative setting where semantic region structure is well defined and real-world feedback is readily available. We view this as an early step toward generative world models as practical AR infrastructure, where future frames can be generated, cached, and selectively corrected on demand.
翻译:生成式世界模型为增强现实(AR)应用提供了引人注目的基础:通过预测融合了刻意视觉编辑的未来图像序列,这些模型能够生成具有时间一致性的增强未来帧,这些帧可以提前计算并缓存,从而避免实时逐帧从头渲染。在本文中,我们提出SEGAR,一个将基于扩散的世界模型与选择性校正阶段相结合的初步框架,以支持这一愿景。该世界模型生成的增强未来帧可对特定区域进行编辑,同时保持其他区域不变;后续的校正阶段则在保留预期增强效果的同时,将安全关键区域与真实世界观测对齐。我们以驾驶场景为例演示了这一流程,该场景中语义区域结构定义明确且可获取真实世界反馈。我们将此视为生成式世界模型作为实用AR基础设施的早期探索——未来帧可按需生成、缓存并选择性校正。