While 2D diffusion models generate realistic, high-detail images, 3D shape generation methods like Score Distillation Sampling (SDS) built on these 2D diffusion models produce cartoon-like, over-smoothed shapes. To help explain this discrepancy, we show that the image guidance used in Score Distillation can be understood as the velocity field of a 2D denoising generative process, up to the choice of a noise term. In particular, after a change of variables, SDS resembles a high-variance version of Denoising Diffusion Implicit Models (DDIM) with a differently-sampled noise term: SDS introduces noise i.i.d. randomly at each step, while DDIM infers it from the previous noise predictions. This excessive variance can lead to over-smoothing and unrealistic outputs. We show that a better noise approximation can be recovered by inverting DDIM in each SDS update step. This modification makes SDS's generative process for 2D images almost identical to DDIM. In 3D, it removes over-smoothing, preserves higher-frequency detail, and brings the generation quality closer to that of 2D samplers. Experimentally, our method achieves better or similar 3D generation quality compared to other state-of-the-art Score Distillation methods, all without training additional neural networks or multi-view supervision, and providing useful insights into relationship between 2D and 3D asset generation with diffusion models.
翻译:尽管二维扩散模型能够生成逼真且细节丰富的图像,但基于这些二维扩散模型构建的三维形状生成方法(如分数蒸馏采样)却产生卡通化、过度平滑的形状。为解释这一差异,我们证明分数蒸馏中使用的图像引导可被理解为二维去噪生成过程的速度场,其差异仅在于噪声项的选择。具体而言,经过变量替换后,SDS类似于一种高方差版本的去噪扩散隐式模型,其噪声项的采样方式不同:SDS在每一步独立随机引入噪声,而DDIM则根据先前的噪声预测推断噪声。这种过高的方差会导致过度平滑和不真实的输出。我们证明,通过在每次SDS更新步骤中反转DDIM,可以恢复更好的噪声近似。这一修改使得SDS在二维图像上的生成过程几乎与DDIM完全相同。在三维生成中,该方法消除了过度平滑现象,保留了更高频的细节,并将生成质量提升至接近二维采样器的水平。实验表明,相较于其他最先进的分数蒸馏方法,我们的方法在三维生成质量上达到更好或相当的水平,且无需训练额外的神经网络或多视角监督,同时为理解扩散模型在二维与三维资产生成之间的关系提供了有价值的见解。