Most text-to-3D generators build upon off-the-shelf text-to-image models trained on billions of images. They use variants of Score Distillation Sampling (SDS), which is slow, somewhat unstable, and prone to artifacts. A mitigation is to fine-tune the 2D generator to be multi-view aware, which can help distillation or can be combined with reconstruction networks to output 3D objects directly. In this paper, we further explore the design space of text-to-3D models. We significantly improve multi-view generation by considering video instead of image generators. Combined with a 3D reconstruction algorithm which, by using Gaussian splatting, can optimize a robust image-based loss, we directly produce high-quality 3D outputs from the generated views. Our new method, IM-3D, reduces the number of evaluations of the 2D generator network 10-100x, resulting in a much more efficient pipeline, better quality, fewer geometric inconsistencies, and higher yield of usable 3D assets.
翻译:大多数文本到3D生成器基于在数十亿张图像上训练的现成文本到图像模型构建。它们采用分数蒸馏采样(SDS)的变体,该方法速度慢、不够稳定且容易产生伪影。一种缓解方案是对2D生成器进行微调以使其具备多视图感知能力,这有助于蒸馏过程,或可与重建网络结合直接输出3D对象。本文进一步探索了文本到3D模型的设计空间。我们通过考虑视频生成器而非图像生成器,显著提升了多视图生成质量。结合一种使用高斯泼溅(Gaussian splatting)技术、能够优化鲁棒性图像损失函数的3D重建算法,我们可直接从生成视图中输出高质量3D结果。我们的新方法IM-3D将2D生成器网络的评估次数减少了10-100倍,从而实现了更高效的流程、更优的质量、更少的几何不一致性以及更高比例的可用3D资产产出。