3DStyle-Diffusion: Pursuing Fine-grained Text-driven 3D Stylization with 2D Diffusion Models

3D content creation via text-driven stylization has played a fundamental challenge to multimedia and graphics community. Recent advances of cross-modal foundation models (e.g., CLIP) have made this problem feasible. Those approaches commonly leverage CLIP to align the holistic semantics of stylized mesh with the given text prompt. Nevertheless, it is not trivial to enable more controllable stylization of fine-grained details in 3D meshes solely based on such semantic-level cross-modal supervision. In this work, we propose a new 3DStyle-Diffusion model that triggers fine-grained stylization of 3D meshes with additional controllable appearance and geometric guidance from 2D Diffusion models. Technically, 3DStyle-Diffusion first parameterizes the texture of 3D mesh into reflectance properties and scene lighting using implicit MLP networks. Meanwhile, an accurate depth map of each sampled view is achieved conditioned on 3D mesh. Then, 3DStyle-Diffusion leverages a pre-trained controllable 2D Diffusion model to guide the learning of rendered images, encouraging the synthesized image of each view semantically aligned with text prompt and geometrically consistent with depth map. This way elegantly integrates both image rendering via implicit MLP networks and diffusion process of image synthesis in an end-to-end fashion, enabling a high-quality fine-grained stylization of 3D meshes. We also build a new dataset derived from Objaverse and the evaluation protocol for this task. Through both qualitative and quantitative experiments, we validate the capability of our 3DStyle-Diffusion. Source code and data are available at \url{https://github.com/yanghb22-fdu/3DStyle-Diffusion-Official}.

翻译：通过文本驱动的风格化进行三维内容创作，一直是多媒体与图形学领域的基础性挑战。跨模态基础模型（如CLIP）的最新进展使该问题变得可行。现有方法通常利用CLIP将风格化网格的整体语义与给定文本提示对齐。然而，仅基于此类语义级跨模态监督实现三维网格细粒度细节的可控风格化并非易事。本文提出一种新型3DStyle-Diffusion模型，通过引入来自二维扩散模型的额外可控外观与几何引导，激发三维网格的细粒度风格化。技术实现上，3DStyle-Diffusion首先利用隐式MLP网络将三维网格纹理参数化为反射属性与场景光照；同时基于三维网格条件获取每个采样视角的精确深度图。随后，该模型借助预训练的可控二维扩散模型引导渲染图像的学习，使各视角合成图像在语义上与文本提示对齐，在几何上与深度图保持一致性。该方法巧妙地将隐式MLP网络图像渲染与图像合成的扩散过程以端到端方式融合，实现了三维网格的高质量细粒度风格化。我们还基于Objaverse构建了面向该任务的新数据集与评估协议，并通过定性与定量实验验证了3DStyle-Diffusion的能力。源代码与数据已发布于 \url{https://github.com/yanghb22-fdu/3DStyle-Diffusion-Official}。