Recent 3D novel view synthesis (NVS) methods are limited to single-object-centric scenes and struggle with complex environments. They often require extensive 3D data for training, lacking generalization beyond the training distribution. Conversely, 3D-free methods can generate text-controlled views of complex, in-the-wild scenes using a pretrained stable diffusion model without the need for a large amount of 3D-based training data, but lack camera control. In this paper, we introduce a method capable of generating camera-controlled viewpoints from a single input image, by combining the benefits of 3D-free and 3D-based approaches. Our method excels in handling complex and diverse scenes without extensive training or additional 3D and multiview data. It leverages widely available pretrained NVS models for weak guidance, integrating this knowledge into a 3D-free view synthesis approach to achieve the desired results. Experimental results demonstrate that our method outperforms existing models in both qualitative and quantitative evaluations, providing high-fidelity and consistent novel view synthesis at desired camera angles across a wide variety of scenes.
翻译:近期的新视角合成方法局限于单一物体中心场景,难以处理复杂环境。这些方法通常需要大量三维数据进行训练,缺乏对训练分布之外数据的泛化能力。相反,三维无约束方法能够利用预训练的稳定扩散模型生成文本控制的复杂自然场景视图,无需大量基于三维的训练数据,但缺乏相机控制能力。本文提出一种方法,通过结合三维无约束与基于三维方法的优势,实现从单张输入图像生成相机可控的视角。该方法在无需大量训练或额外三维/多视图数据的情况下,能出色处理复杂多样的场景。我们利用广泛可得的预训练新视角合成模型进行弱引导,将该知识融入三维无约束视角合成框架以实现预期效果。实验结果表明,本方法在定性与定量评估中均优于现有模型,能够在多样化场景中按指定相机角度生成高保真且一致的新视角合成图像。