Text-to-3D generation has achieved significant success by incorporating powerful 2D diffusion models, but insufficient 3D prior knowledge also leads to the inconsistency of 3D geometry. Recently, since large-scale multi-view datasets have been released, fine-tuning the diffusion model on the multi-view datasets becomes a mainstream to solve the 3D inconsistency problem. However, it has confronted with fundamental difficulties regarding the limited quality and diversity of 3D data, compared with 2D data. To sidestep these trade-offs, we explore a retrieval-augmented approach tailored for score distillation, dubbed RetDream. We postulate that both expressiveness of 2D diffusion models and geometric consistency of 3D assets can be fully leveraged by employing the semantically relevant assets directly within the optimization process. To this end, we introduce novel framework for retrieval-based quality enhancement in text-to-3D generation. We leverage the retrieved asset to incorporate its geometric prior in the variational objective and adapt the diffusion model's 2D prior toward view consistency, achieving drastic improvements in both geometry and fidelity of generated scenes. We conduct extensive experiments to demonstrate that RetDream exhibits superior quality with increased geometric consistency. Project page is available at https://ku-cvlab.github.io/RetDream/.
翻译:文本到三维生成通过引入强大的二维扩散模型取得了显著成功,但三维先验知识的不足导致了几何不一致性问题。近年来,随着大规模多视图数据集的发布,在三维视图数据集上微调扩散模型已成为解决该问题的主流方法。然而,相比二维数据,三维数据在质量和多样性方面的固有局限性带来了根本性挑战。为规避这些权衡,我们提出了一种专为分数蒸馏设计的检索增强方法,命名为RetDream。我们假设通过直接在优化过程中使用语义相关的三维资产,可以同时发挥二维扩散模型的表达能力和三维资产的几何一致性。为此,我们引入了一种基于检索的文本到三维生成质量增强新框架。通过利用检索到的资产,我们将其几何先验融入变分目标,并调整扩散模型的二维先验以增强视图一致性,从而在生成场景的几何结构和保真度上实现显著提升。大量实验表明,RetDream在几何一致性增强的同时展现出优越的生成质量。项目主页详见https://ku-cvlab.github.io/RetDream/。