Recent innovations on text-to-3D generation have featured Score Distillation Sampling (SDS), which enables the zero-shot learning of implicit 3D models (NeRF) by directly distilling prior knowledge from 2D diffusion models. However, current SDS-based models still struggle with intricate text prompts and commonly result in distorted 3D models with unrealistic textures or cross-view inconsistency issues. In this work, we introduce a novel Visual Prompt-guided text-to-3D diffusion model (VP3D) that explicitly unleashes the visual appearance knowledge in 2D visual prompt to boost text-to-3D generation. Instead of solely supervising SDS with text prompt, VP3D first capitalizes on 2D diffusion model to generate a high-quality image from input text, which subsequently acts as visual prompt to strengthen SDS optimization with explicit visual appearance. Meanwhile, we couple the SDS optimization with additional differentiable reward function that encourages rendering images of 3D models to better visually align with 2D visual prompt and semantically match with text prompt. Through extensive experiments, we show that the 2D Visual Prompt in our VP3D significantly eases the learning of visual appearance of 3D models and thus leads to higher visual fidelity with more detailed textures. It is also appealing in view that when replacing the self-generating visual prompt with a given reference image, VP3D is able to trigger a new task of stylized text-to-3D generation. Our project page is available at https://vp3d-cvpr24.github.io.
翻译:近年来,文本到三维生成的创新主要采用分数蒸馏采样(SDS)技术,通过直接从二维扩散模型中蒸馏先验知识,实现隐式三维模型(NeRF)的零样本学习。然而,当前基于SDS的模型在处理复杂文本提示时仍面临挑战,常导致三维模型出现畸变、纹理不真实或视角不一致等问题。本文提出一种新颖的视觉提示引导文本到三维扩散模型(VP3D),通过显式释放二维视觉提示中的外观知识来提升文本到三维生成效果。VP3D并非仅依赖文本提示监督SDS,而是首先利用二维扩散模型从输入文本生成高质量图像,该图像随后作为视觉提示,通过显式视觉外观增强SDS优化过程。同时,我们将SDS优化与额外的可微分奖励函数相结合,鼓励三维模型的渲染图像在视觉上更贴近二维视觉提示,并在语义上匹配文本提示。通过大量实验证明,VP3D中的二维视觉提示显著简化了三维模型视觉外观的学习过程,从而生成具有更精细纹理和更高视觉保真度的结果。值得注意的是,当将自生成的视觉提示替换为给定参考图像时,VP3D还能触发一项新任务——风格化文本到三维生成。项目主页:https://vp3d-cvpr24.github.io。