Generating multi-view images from a single input view using image-conditioned diffusion models is a recent advancement and has shown considerable potential. However, issues such as the lack of consistency in synthesized views and over-smoothing in extracted geometry persist. Previous methods integrate multi-view consistency modules or impose additional supervisory to enhance view consistency while compromising on the flexibility of camera positioning and limiting the versatility of view synthesis. In this study, we consider the radiance field optimized during geometry extraction as a more rigid consistency prior, compared to volume and ray aggregation used in previous works. We further identify and rectify a critical bias in the traditional radiance field optimization process through score distillation from a multi-view diffuser. We introduce an Unbiased Score Distillation (USD) that utilizes unconditioned noises from a 2D diffusion model, greatly refining the radiance field fidelity. we leverage the rendered views from the optimized radiance field as the basis and develop a two-step specialization process of a 2D diffusion model, which is adept at conducting object-specific denoising and generating high-quality multi-view images. Finally, we recover faithful geometry and texture directly from the refined multi-view images. Empirical evaluations demonstrate that our optimized geometry and view distillation technique generates comparable results to the state-of-the-art models trained on extensive datasets, all while maintaining freedom in camera positioning.
翻译:从单张输入视图生成多视角图像是基于图像条件扩散模型的最新进展,并展现出巨大潜力。然而,合成视图缺乏一致性以及提取几何中的过度平滑等问题仍然存在。先前的方法通过集成多视角一致性模块或施加额外监督来增强视图一致性,但这会牺牲相机定位的灵活性并限制视图合成的多样性。在本研究中,我们将几何提取过程中优化的辐射场视为比先前工作中使用的体素和光线聚合更严格的一致性先验。我们进一步通过多视角扩散器的分数蒸馏,识别并纠正了传统辐射场优化过程中的关键偏差。我们提出了一种无偏分数蒸馏(USD),利用来自二维扩散模型的无条件噪声,显著提升了辐射场保真度。我们将优化后的辐射场渲染视图作为基础,开发了二维扩散模型的两步专化过程,该过程擅长执行对象特定的去噪并生成高质量多视角图像。最终,我们从精炼的多视角图像中直接恢复出忠实的几何与纹理。实验评估表明,我们的优化几何与视图蒸馏技术能够生成与基于大规模数据集训练的最先进模型相媲美的结果,同时保持相机定位的自由度。