The synthesis of immersive 3D scenes from text is rapidly maturing, driven by novel video generative models and feed-forward 3D reconstruction, with vast potential in AR/VR and world modeling. While panoramic images have proven effective for scene initialization, existing approaches suffer from a trade-off between visual fidelity and explorability: autoregressive expansion suffers from context drift, while panoramic video generation is limited to low resolution. We present Stepper, a unified framework for text-driven immersive 3D scene synthesis that circumvents these limitations via stepwise panoramic scene expansion. Stepper leverages a novel multi-view 360° diffusion model that enables consistent, high-resolution expansion, coupled with a geometry reconstruction pipeline that enforces geometric coherence. Trained on a new large-scale, multi-view panorama dataset, Stepper achieves state-of-the-art fidelity and structural consistency, outperforming prior approaches, thereby setting a new standard for immersive scene generation.
翻译:从文本合成沉浸式3D场景的技术正迅速成熟,这得益于新颖的视频生成模型和前馈式三维重建,在增强现实/虚拟现实及世界建模领域具有巨大潜力。尽管全景图像已被证明对场景初始化有效,但现有方法在视觉保真度和可探索性之间存在权衡:自回归扩展存在上下文漂移问题,而全景视频生成受限于低分辨率。我们提出Stepper,一个用于文本驱动的沉浸式3D场景合成的统一框架,通过逐步全景场景扩展规避了这些限制。Stepper利用新颖的多视图360°扩散模型实现一致的高分辨率扩展,并配合几何重建管道增强几何连贯性。基于新的大规模多视图全景数据集训练,Stepper在保真度和结构一致性上达到当前最优水平,超越先前方法,为沉浸式场景生成树立了新标准。