Diffusion-based image generation models such as DALL-E 3 and Stable Diffusion-XL demonstrate remarkable capabilities in generating images with realistic and unique compositions. Yet, these models are not robust in precisely reasoning about physical and spatial configurations of objects, especially when instructed with unconventional, thereby out-of-distribution descriptions, such as "a chair with five legs". In this paper, we propose a language agent with chain-of-3D-thoughts (L3GO), an inference-time approach that can reason about part-based 3D mesh generation of unconventional objects that current data-driven diffusion models struggle with. More concretely, we use large language models as agents to compose a desired object via trial-and-error within the 3D simulation environment. To facilitate our investigation, we develop a new benchmark, Unconventionally Feasible Objects (UFO), as well as SimpleBlenv, a wrapper environment built on top of Blender where language agents can build and compose atomic building blocks via API calls. Human and automatic GPT-4V evaluations show that our approach surpasses the standard GPT-4 and other language agents (e.g., ReAct and Reflexion) for 3D mesh generation on ShapeNet. Moreover, when tested on our UFO benchmark, our approach outperforms other state-of-the-art text-to-2D image and text-to-3D models based on human evaluation.
翻译:基于扩散的图像生成模型(如DALL-E 3和Stable Diffusion-XL)在生成逼真且具有独特构型的图像方面展现了卓越能力。然而,这些模型在精确推理物体的物理和空间配置方面不够鲁棒,尤其当面对诸如"五条腿的椅子"这类非常规(即分布外)描述时。本文提出了一种基于三维思维链的语言智能体(L3GO),这是一种推理时方法,能够对当前数据驱动扩散模型难以处理的非传统物体进行基于部件的三维网格生成推理。具体而言,我们使用大语言模型作为智能体,通过在三维仿真环境中进行试错来组合所需物体。为促进研究,我们开发了新基准——非传统可行物体(UFO)数据集,以及基于Blender构建的SimpleBlenv封装环境,该环境允许语言智能体通过API调用构建和组合原子级基本构件。人类评估与自动GPT-4V评估显示,我们的方法在ShapeNet三维网格生成任务上超越了标准GPT-4及其他语言智能体(如ReAct和Reflexion)。此外,在UFO基准测试中,基于人类评估,我们的方法优于其他最先进的文本到二维图像和文本到三维模型。