Automating immersive VR scene creation remains a primary research challenge. Existing methods typically rely on complex geometry with post-simplification, resulting in inefficient pipelines or limited realism. In this paper, we introduce ImmerseGen, a novel agent-guided framework for compact and photorealistic world generation that decouples realism from exhaustive geometric modeling. ImmerseGen represents scenes as hierarchical compositions of lightweight geometric proxies with synthesized RGBA textures, facilitating real-time rendering on mobile VR headsets. We propose terrain-conditioned texturing for base world generation, combined with context-aware texturing for scenery, to produce diverse and visually coherent worlds. VLM-based agents employ semantic grid-based analysis for precise asset placement and enrich scenes with multimodal enhancements such as visual dynamics and ambient sound. Experiments and real-time VR applications demonstrate that ImmerseGen achieves superior photorealism, spatial coherence, and rendering efficiency compared to existing methods.
翻译:自动创建沉浸式VR场景仍是一项重要的研究挑战。现有方法通常依赖复杂几何结构并辅以后续简化,导致生成流程低效或逼真度有限。本文提出ImmerseGen,一种新颖的智能体引导框架,用于生成紧凑且逼真的世界,将真实感与详尽的几何建模解耦。ImmerseGen将场景表示为轻量化几何代理的层次组合,并辅以合成的RGBA纹理,从而支持在移动VR头显上的实时渲染。我们提出基于地形条件的纹理生成用于基础世界构建,结合上下文感知的纹理生成用于场景渲染,以生成多样且视觉一致的世界。基于视觉语言模型(VLM)的智能体采用语义网格分析进行精确的物体放置,并通过多模态增强(如视觉动态和环境音效)丰富场景。实验与实时VR应用表明,与现有方法相比,ImmerseGen在照片级真实感、空间一致性与渲染效率方面均表现更优。