This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.
翻译:本文介绍了场景工匠,一个将文本描述转换为可执行Blender Python脚本的大型语言模型智能体,该脚本可渲染包含多达百个3D资产的复杂场景。这一过程需要复杂的空间规划与布局。我们通过结合高级抽象、策略规划与库学习来应对这些挑战。场景工匠首先构建场景图作为蓝图,详细描述场景中资产的空间关系;然后基于该图编写Python脚本,将关系转化为资产布局的数值约束。随后,场景工匠利用视觉-语言基础模型(如GPT-V)的感知优势分析渲染图像,并迭代优化场景。在此过程之上,场景工匠配备了库学习机制,将通用脚本函数编译为可复用库,从而无需昂贵的LLM参数调优即可实现持续自我改进。评估表明,在渲染复杂场景方面,场景工匠在约束遵守度与人工评估优势上均超越现有基于LLM的智能体。我们还通过从Sintel电影重建详细3D场景,以及将生成场景作为中间控制信号引导视频生成模型,展示了场景工匠更广泛的应用潜力。