Despite recent advancements in text-to-image generation, most existing methods struggle to create images with multiple objects and complex spatial relationships in 3D world. To tackle this limitation, we introduce a generic AI system, namely MUSES, for 3D-controllable image generation from user queries. Specifically, our MUSES addresses this challenging task by developing a progressive workflow with three key components, including (1) Layout Manager for 2D-to-3D layout lifting, (2) Model Engineer for 3D object acquisition and calibration, (3) Image Artist for 3D-to-2D image rendering. By mimicking the collaboration of human professionals, this multi-modal agent pipeline facilitates the effective and automatic creation of images with 3D-controllable objects, through an explainable integration of top-down planning and bottom-up generation. Additionally, we find that existing benchmarks lack detailed descriptions of complex 3D spatial relationships of multiple objects. To fill this gap, we further construct a new benchmark of T2I-3DisBench (3D image scene), which describes diverse 3D image scenes with 50 detailed prompts. Extensive experiments show the state-of-the-art performance of MUSES on both T2I-CompBench and T2I-3DisBench, outperforming recent strong competitors such as DALL-E 3 and Stable Diffusion 3. These results demonstrate a significant step of MUSES forward in bridging natural language, 2D image generation, and 3D world.
翻译:尽管文本到图像生成领域近期取得了进展,但现有方法大多难以创建包含多个对象且具有复杂三维空间关系的图像。为应对这一局限,我们引入了一个通用的AI系统,即MUSES,用于根据用户查询生成三维可控图像。具体而言,我们的MUSES通过开发一个包含三个关键组件的渐进式工作流程来解决这一挑战性任务,包括:(1) 用于二维到三维布局提升的布局管理器,(2) 用于三维对象获取与校准的模型工程师,(3) 用于三维到二维图像渲染的图像艺术家。通过模拟人类专业人员的协作,这一多模态智能体流程通过可解释的自顶向下规划与自底向上生成的结合,促进了具有三维可控对象的图像的有效自动创建。此外,我们发现现有基准测试缺乏对多对象复杂三维空间关系的详细描述。为填补这一空白,我们进一步构建了一个新的基准测试T2I-3DisBench(三维图像场景),其中包含50个详细提示,描述了多样化的三维图像场景。大量实验表明,MUSES在T2I-CompBench和T2I-3DisBench上均取得了最先进的性能,超越了DALL-E 3和Stable Diffusion 3等近期强劲竞争对手。这些结果证明了MUSES在连接自然语言、二维图像生成与三维世界方面迈出的重要一步。