The growing interest in novel view synthesis, driven by Neural Radiance Field (NeRF) models, is hindered by scalability issues due to their reliance on precisely annotated multi-view images. Recent models address this by fine-tuning large text2image diffusion models on synthetic multi-view data. Despite robust zero-shot generalization, they may need post-processing and can face quality issues due to the synthetic-real domain gap. This paper introduces a novel pipeline for unsupervised training of a pose-conditioned diffusion model on single-category datasets. With the help of pretrained self-supervised Vision Transformers (DINOv2), we identify object poses by clustering the dataset through comparing visibility and locations of specific object parts. The pose-conditioned diffusion model, trained on pose labels, and equipped with cross-frame attention at inference time ensures cross-view consistency, that is further aided by our novel hard-attention guidance. Our model, MIRAGE, surpasses prior work in novel view synthesis on real images. Furthermore, MIRAGE is robust to diverse textures and geometries, as demonstrated with our experiments on synthetic images generated with pretrained Stable Diffusion.
翻译:随着神经辐射场(NeRF)模型的兴起,新颖视角合成领域备受关注,但由于其依赖精确标注的多视角图像,可扩展性问题成为发展瓶颈。现有模型通过在合成多视角数据上微调大型文生图扩散模型来应对这一挑战。尽管具备强大的零样本泛化能力,但这类模型仍需后处理,且因合成-真实数据域差距可能面临质量问题。本文提出一种全新流水线,可在单类别数据集上对姿态条件扩散模型进行无监督训练。借助预训练的自监督视觉Transformer(DINOv2),我们通过比较目标特定部分的可见性与位置对数据集进行聚类,从而识别目标姿态。基于姿态标签训练的扩散模型,在推理阶段采用跨帧注意力机制确保跨视角一致性,而我们所提出的新颖硬注意力引导机制进一步增强了这一特性。我们提出的MIRAGE模型在真实图像的新颖视角合成任务上超越先前工作。此外,在预训练Stable Diffusion生成的合成图像上开展的实验表明,MIRAGE对多样化纹理与几何形态具有鲁棒性。