Recent remarkable advances in large-scale text-to-image diffusion models have inspired a significant breakthrough in text-to-3D generation, pursuing 3D content creation solely from a given text prompt. However, existing text-to-3D techniques lack a crucial ability in the creative process: interactively control and shape the synthetic 3D contents according to users' desired specifications (e.g., sketch). To alleviate this issue, we present the first attempt for text-to-3D generation conditioning on the additional hand-drawn sketch, namely Control3D, which enhances controllability for users. In particular, a 2D conditioned diffusion model (ControlNet) is remoulded to guide the learning of 3D scene parameterized as NeRF, encouraging each view of 3D scene aligned with the given text prompt and hand-drawn sketch. Moreover, we exploit a pre-trained differentiable photo-to-sketch model to directly estimate the sketch of the rendered image over synthetic 3D scene. Such estimated sketch along with each sampled view is further enforced to be geometrically consistent with the given sketch, pursuing better controllable text-to-3D generation. Through extensive experiments, we demonstrate that our proposal can generate accurate and faithful 3D scenes that align closely with the input text prompts and sketches.
翻译:近期大规模文本到图像扩散模型取得的显著进展,激发了文本到三维生成领域的重大突破,旨在仅凭给定文本提示即可实现三维内容创作。然而,现有文本到三维技术在创作过程中缺乏一项关键能力:根据用户所需规格(如草图)对合成三维内容进行交互式控制与塑形。为解决此问题,我们首次提出基于附加手绘草图的文本到三维生成方法——Control3D,该方法增强了用户的可控性。具体而言,我们将二维条件扩散模型(ControlNet)进行改造,以引导参数化为NeRF的三维场景学习,促使三维场景的每个视图与给定文本提示和手绘草图对齐。此外,我们利用预训练的可微分照片到草图模型,直接估计合成三维场景渲染图像的草图。此估计草图与每个采样视图进一步被强制与给定草图保持几何一致性,以实现更优的可控文本到三维生成。通过大量实验,我们证明所提方法能够生成与输入文本提示和草图紧密对齐的准确且忠实的三维场景。