Large image diffusion models enable novel view synthesis with high quality and excellent zero-shot capability. However, such models based on image-to-image translation have no guarantee of view consistency, limiting the performance for downstream tasks like 3D reconstruction and image-to-3D generation. To empower consistency, we propose Consistent123 to synthesize novel views simultaneously by incorporating additional cross-view attention layers and the shared self-attention mechanism. The proposed attention mechanism improves the interaction across all synthesized views, as well as the alignment between the condition view and novel views. In the sampling stage, such architecture supports simultaneously generating an arbitrary number of views while training at a fixed length. We also introduce a progressive classifier-free guidance strategy to achieve the trade-off between texture and geometry for synthesized object views. Qualitative and quantitative experiments show that Consistent123 outperforms baselines in view consistency by a large margin. Furthermore, we demonstrate a significant improvement of Consistent123 on varying downstream tasks, showing its great potential in the 3D generation field. The project page is available at consistent-123.github.io.
翻译:大规模图像扩散模型能够以高质量和出色的零样本能力实现新颖视角合成。然而,基于图像到图像翻译的此类模型无法保证视角一致性,限制了其在三维重建和图像到三维生成等下游任务中的性能。为增强一致性,我们提出Consistent123,通过引入额外的跨视角注意力层和共享自注意力机制来同步合成新颖视角。所提出的注意力机制改善了所有合成视角之间的交互,以及条件视角与新颖视角之间的对齐。在采样阶段,该架构支持同时生成任意数量的视角,同时在固定长度下进行训练。我们还引入了一种渐进式无分类器引导策略,以在合成物体视角的纹理与几何之间实现权衡。定性与定量实验表明,Consistent123在视角一致性上大幅优于基线方法。此外,我们展示了Consistent123在各种下游任务上的显著提升,彰显了其在三维生成领域的巨大潜力。项目页面见consistent-123.github.io。