Text-to-3D scene generation from natural language is highly desirable for digital content creation. However, existing methods are largely domain-restricted or reliant on predefined spatial relationships, limiting their capacity for unconstrained, open-vocabulary 3D scene synthesis. In this paper, we introduce SceneAssistant, a visual-feedback-driven agent designed for open-vocabulary 3D scene generation. Our framework leverages modern 3D object generation model along with the spatial reasoning and planning capabilities of Vision-Language Models (VLMs). To enable open-vocabulary scene composition, we provide the VLMs with a comprehensive set of atomic operations (e.g., Scale, Rotate, FocusOn). At each interaction step, the VLM receives rendered visual feedback and takes actions accordingly, iteratively refining the scene to achieve more coherent spatial arrangements and better alignment with the input text. Experimental results demonstrate that our method can generate diverse, open-vocabulary, and high-quality 3D scenes. Both qualitative analysis and quantitative human evaluations demonstrate the superiority of our approach over existing methods. Furthermore, our method allows users to instruct the agent to edit existing scenes based on natural language commands. Our code is available at https://github.com/ROUJINN/SceneAssistant
翻译:从自然语言生成文本到3D场景对于数字内容创作极具吸引力。然而,现有方法大多受限于特定领域或依赖于预定义的空间关系,限制了其进行无约束、开放词汇3D场景合成的能力。本文提出SceneAssistant,一种专为开放词汇3D场景生成设计的视觉反馈驱动智能体。我们的框架利用现代3D物体生成模型,并结合视觉-语言模型(VLMs)的空间推理与规划能力。为实现开放词汇的场景组合,我们为VLMs提供了一套全面的原子操作(例如:缩放、旋转、聚焦于)。在每次交互步骤中,VLM接收渲染的视觉反馈并据此采取行动,通过迭代优化场景以实现更连贯的空间布局和与输入文本更好的对齐。实验结果表明,我们的方法能够生成多样化、开放词汇且高质量的3D场景。定性分析和定量人工评估均证明了本方法相对于现有方法的优越性。此外,我们的方法允许用户通过自然语言指令指导智能体编辑现有场景。代码发布于 https://github.com/ROUJINN/SceneAssistant