Points-to-3D: Bridging the Gap between Sparse Points and Shape-Controllable Text-to-3D Generation

Text-to-3D generation has recently garnered significant attention, fueled by 2D diffusion models trained on billions of image-text pairs. Existing methods primarily rely on score distillation to leverage the 2D diffusion priors to supervise the generation of 3D models, e.g., NeRF. However, score distillation is prone to suffer the view inconsistency problem, and implicit NeRF modeling can also lead to an arbitrary shape, thus leading to less realistic and uncontrollable 3D generation. In this work, we propose a flexible framework of Points-to-3D to bridge the gap between sparse yet freely available 3D points and realistic shape-controllable 3D generation by distilling the knowledge from both 2D and 3D diffusion models. The core idea of Points-to-3D is to introduce controllable sparse 3D points to guide the text-to-3D generation. Specifically, we use the sparse point cloud generated from the 3D diffusion model, Point-E, as the geometric prior, conditioned on a single reference image. To better utilize the sparse 3D points, we propose an efficient point cloud guidance loss to adaptively drive the NeRF's geometry to align with the shape of the sparse 3D points. In addition to controlling the geometry, we propose to optimize the NeRF for a more view-consistent appearance. To be specific, we perform score distillation to the publicly available 2D image diffusion model ControlNet, conditioned on text as well as depth map of the learned compact geometry. Qualitative and quantitative comparisons demonstrate that Points-to-3D improves view consistency and achieves good shape controllability for text-to-3D generation. Points-to-3D provides users with a new way to improve and control text-to-3D generation.

翻译：文本到三维生成技术近来因基于数十亿图文对训练的二维扩散模型而备受关注。现有方法主要依赖分数蒸馏技术，利用二维扩散先验监督三维模型（如NeRF）的生成过程。然而，分数蒸馏易引发视角不一致问题，且隐式NeRF建模可能导致形状失控，从而产生不够真实且不可控的三维生成结果。本文提出一种灵活的点云至三维框架，通过蒸馏二维与三维扩散模型的知识，弥合稀疏但易获取的三维点云与真实感形状可控三维生成之间的鸿沟。其核心思想在于引入可控稀疏三维点云引导文本到三维生成。具体而言，我们以三维扩散模型Point-E生成的稀疏点云作为几何先验，并以其对单张参考图像的条件约束为基础。为更好利用稀疏三维点云，我们提出高效点云引导损失函数，自适应驱动NeRF几何结构对齐稀疏点云的形状。除几何控制外，我们进一步优化NeRF以实现更一致的视角外观：具体通过对公开二维图像扩散模型ControlNet进行分数蒸馏，以文本及已学紧凑几何的深度图作为条件约束。定性与定量对比表明，点云至三维框架能提升视角一致性并实现良好的文本到三维生成形状可控性，为用户优化和调控文本到三维生成提供了新途径。