SceneCraft: An LLM Agent for Synthesizing 3D Scene as Blender Code

This paper introduces SceneCraft, a Large Language Model (LLM) Agent converting text descriptions into Blender-executable Python scripts which render complex scenes with up to a hundred 3D assets. This process requires complex spatial planning and arrangement. We tackle these challenges through a combination of advanced abstraction, strategic planning, and library learning. SceneCraft first models a scene graph as a blueprint, detailing the spatial relationships among assets in the scene. SceneCraft then writes Python scripts based on this graph, translating relationships into numerical constraints for asset layout. Next, SceneCraft leverages the perceptual strengths of vision-language foundation models like GPT-V to analyze rendered images and iteratively refine the scene. On top of this process, SceneCraft features a library learning mechanism that compiles common script functions into a reusable library, facilitating continuous self-improvement without expensive LLM parameter tuning. Our evaluation demonstrates that SceneCraft surpasses existing LLM-based agents in rendering complex scenes, as shown by its adherence to constraints and favorable human assessments. We also showcase the broader application potential of SceneCraft by reconstructing detailed 3D scenes from the Sintel movie and guiding a video generative model with generated scenes as intermediary control signal.

翻译：本文介绍了场景工匠，一个将文本描述转换为可执行Blender Python脚本的大型语言模型智能体，该脚本可渲染包含多达百个3D资产的复杂场景。这一过程需要复杂的空间规划与布局。我们通过结合高级抽象、策略规划与库学习来应对这些挑战。场景工匠首先构建场景图作为蓝图，详细描述场景中资产的空间关系；然后基于该图编写Python脚本，将关系转化为资产布局的数值约束。随后，场景工匠利用视觉-语言基础模型（如GPT-V）的感知优势分析渲染图像，并迭代优化场景。在此过程之上，场景工匠配备了库学习机制，将通用脚本函数编译为可复用库，从而无需昂贵的LLM参数调优即可实现持续自我改进。评估表明，在渲染复杂场景方面，场景工匠在约束遵守度与人工评估优势上均超越现有基于LLM的智能体。我们还通过从Sintel电影重建详细3D场景，以及将生成场景作为中间控制信号引导视频生成模型，展示了场景工匠更广泛的应用潜力。

相关内容

大语言模型

关注 67

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日