In this paper, we introduce a novel 3D-aware image generation method that leverages 2D diffusion models. We formulate the 3D-aware image generation task as multiview 2D image set generation, and further to a sequential unconditional-conditional multiview image generation process. This allows us to utilize 2D diffusion models to boost the generative modeling power of the method. Additionally, we incorporate depth information from monocular depth estimators to construct the training data for the conditional diffusion model using only still images. We train our method on a large-scale dataset, i.e., ImageNet, which is not addressed by previous methods. It produces high-quality images that significantly outperform prior methods. Furthermore, our approach showcases its capability to generate instances with large view angles, even though the training images are diverse and unaligned, gathered from "in-the-wild" real-world environments.
翻译:本文提出了一种新颖的三维感知图像生成方法,该方法利用二维扩散模型。我们将三维感知图像生成任务表述为多视角二维图像集生成,并进一步转化为序列化的无条件-条件化多视角图像生成过程。这使得我们能够利用二维扩散模型来增强方法的生成建模能力。此外,我们结合单目深度估计器提供的深度信息,仅使用静态图像构建条件扩散模型的训练数据。我们在大规模数据集(即ImageNet)上训练该方法,这是先前方法未涉及的领域。该方法能生成高质量图像,其效果显著优于先前方法。更重要的是,即使训练图像来源于"野外"真实环境,具有多样性和非对齐性,我们的方法仍展现出生成大视角图像实例的能力。