Synthesizing novel views from a single input image is a challenging task. It requires extrapolating the 3D structure of a scene while inferring details in occluded regions, and maintaining geometric consistency across viewpoints. Many existing methods must fine-tune large diffusion backbones using multiple views or train a diffusion model from scratch, which is extremely expensive. Additionally, they suffer from blurry reconstruction and poor generalization. This gap presents the opportunity to explore an explicit lightweight view translation framework that can directly utilize the high-fidelity generative capabilities of a pretrained diffusion model while reconstructing a scene from a novel view. Given the DDIM-inverted latent of a single input image, we employ a camera pose-conditioned translation U-Net, TUNet, to predict the inverted latent corresponding to the desired target view. However, the image sampled using the predicted latent may result in a blurry reconstruction. To this end, we propose a novel fusion strategy that exploits the inherent noise correlation structure observed in DDIM inversion. The proposed fusion strategy helps preserve the texture and fine-grained details. To synthesize the novel view, we use the fused latent as the initial condition for DDIM sampling, leveraging the generative prior of the pretrained diffusion model. Extensive experiments on MVImgNet demonstrate that our method outperforms existing methods.
翻译:从单张输入图像合成新视角是一项具有挑战性的任务。它需要在推断被遮挡区域细节的同时,外推场景的三维结构,并保持跨视角的几何一致性。许多现有方法必须使用多视角对大型扩散主干进行微调,或从头训练扩散模型,这极其昂贵。此外,它们存在重建模糊和泛化能力差的问题。这一差距为探索一种显式的轻量级视角转换框架提供了机会,该框架能够在重建新视角场景时,直接利用预训练扩散模型的高保真生成能力。给定单张输入图像的DDIM反演潜在表示,我们采用一个以相机姿态为条件的转换U-Net(TUNet)来预测对应于目标视角的反演潜在表示。然而,使用预测的潜在表示采样得到的图像可能导致模糊的重建。为此,我们提出了一种新颖的融合策略,该策略利用了在DDIM反演中观察到的固有噪声相关性结构。所提出的融合策略有助于保留纹理和细粒度细节。为了合成新视角,我们将融合后的潜在表示作为DDIM采样的初始条件,从而利用预训练扩散模型的生成先验。在MVImgNet上进行的大量实验表明,我们的方法优于现有方法。