With the onset of diffusion-based generative models and their ability to generate text-conditioned images, content generation has received a massive invigoration. Recently, these models have been shown to provide useful guidance for the generation of 3D graphics assets. However, existing work in text-conditioned 3D generation faces fundamental constraints: (i) inability to generate detailed, multi-object scenes, (ii) inability to textually control multi-object configurations, and (iii) physically realistic scene composition. In this work, we propose CG3D, a method for compositionally generating scalable 3D assets that resolves these constraints. We find that explicit Gaussian radiance fields, parameterized to allow for compositions of objects, possess the capability to enable semantically and physically consistent scenes. By utilizing a guidance framework built around this explicit representation, we show state of the art results, capable of even exceeding the guiding diffusion model in terms of object combinations and physics accuracy.
翻译:随着扩散生成模型的兴起及其生成文本条件图像的能力,内容生成得到了极大推动。近期研究表明,这些模型可为三维图形资产生成提供有效指导。然而,现有文本条件三维生成工作面临根本性约束:(i)无法生成包含详细多物体场景的复杂内容,(ii)无法通过文本控制多物体配置,(iii)难以实现物理真实的场景组合。本研究提出CG3D方法,通过组合式生成可扩展三维资产来解决上述约束。我们发现,经参数化支持物体组合的显式高斯辐射场,具备实现语义与物理一致性场景的能力。通过构建基于该显式表示框架的引导机制,我们展示了超越现有技术的成果,在物体组合能力与物理准确性方面甚至能超越作为引导的扩散模型。