Diffusion-based image generators can now produce high-quality and diverse samples, but their success has yet to fully translate to 3D generation: existing diffusion methods can either generate low-resolution but 3D consistent outputs, or detailed 2D views of 3D objects but with potential structural defects and lacking view consistency or realism. We present HoloFusion, a method that combines the best of these approaches to produce high-fidelity, plausible, and diverse 3D samples while learning from a collection of multi-view 2D images only. The method first generates coarse 3D samples using a variant of the recently proposed HoloDiffusion generator. Then, it independently renders and upsamples a large number of views of the coarse 3D model, super-resolves them to add detail, and distills those into a single, high-fidelity implicit 3D representation, which also ensures view consistency of the final renders. The super-resolution network is trained as an integral part of HoloFusion, end-to-end, and the final distillation uses a new sampling scheme to capture the space of super-resolved signals. We compare our method against existing baselines, including DreamFusion, Get3D, EG3D, and HoloDiffusion, and achieve, to the best of our knowledge, the most realistic results on the challenging CO3Dv2 dataset.
翻译:基于扩散的图像生成器现可生成高质量且多样化的样本,但其成功尚未完全转化为3D生成:现有扩散方法或可生成低分辨率但3D一致的输出,或能生成拥有细节但存在潜在结构缺陷、缺乏视角一致性或真实感的3D对象2D视图。我们提出HoloFusion,该方法融合了上述两类方法的优势,仅需从多视角2D图像集合中学习,即可生成高保真、逼真且多样化的3D样本。该方法首先使用近期提出的HoloDiffusion生成器变体生成粗糙3D样本,随后独立渲染并上采样粗3D模型的大量视角,通过超分辨率处理添加细节,并将其蒸馏至单一、高保真的隐式3D表示中,最终渲染结果同时保证视角一致性。超分辨率网络作为HoloFusion的组成部分以端到端方式训练,最终蒸馏采用新采样方案以捕获超分辨率信号空间。我们将所提方法与包括DreamFusion、Get3D、EG3D及HoloDiffusion在内的现有基线进行对比,在具有挑战性的CO3Dv2数据集上取得了目前最佳的真实感生成结果。