Generating multi-view images based on text or single-image prompts is a critical capability for the creation of 3D content. Two fundamental questions on this topic are what data we use for training and how to ensure multi-view consistency. This paper introduces a novel framework that makes fundamental contributions to both questions. Unlike leveraging images from 2D diffusion models for training, we propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models. Images from video generative models are more suitable for multi-view generation because the underlying network architecture that generates them employs a temporal module to enforce frame consistency. Moreover, the video data sets used to train these models are abundant and diverse, leading to a reduced train-finetuning domain gap. To enhance multi-view consistency, we introduce a 3D-Aware Denoising Sampling, which first employs a feed-forward reconstruction module to get an explicit global 3D model, and then adopts a sampling strategy that effectively involves images rendered from the global 3D model into the denoising sampling loop to improve the multi-view consistency of the final images. As a by-product, this module also provides a fast way to create 3D assets represented by 3D Gaussians within a few seconds. Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches (4 GPU hours versus many thousand GPU hours) with comparable visual quality and consistency. By further fine-tuning, our approach outperforms existing state-of-the-art methods in both quantitative metrics and visual effects. Our project page is aigc3d.github.io/VideoMV.
翻译:基于文本或单图像提示生成多视角图像是三维内容创作的关键能力。该主题的两个基本问题在于:训练数据的来源以及如何确保多视角一致性。本文提出了一种新颖框架,对这两个问题做出了根本性贡献。与使用二维扩散模型图像进行训练不同,我们提出了一种密集一致的多视图生成模型,该模型从现成的视频生成模型微调而来。视频生成模型中的图像更适合多视角生成,因为其底层网络架构采用了时间模块来强制帧一致性。此外,用于训练这些模型的视频数据集丰富多样,从而减小了训练-微调之间的领域鸿沟。为了增强多视图一致性,我们引入了三维感知去噪采样方法:首先采用前馈重建模块获取显式全局三维模型,然后采用一种采样策略,将全局三维模型渲染的图像有效融入去噪采样循环中,以提升最终图像的多视图一致性。作为副产品,该模块还提供了一种快速创建三维资产的方法,可在数秒内生成以三维高斯表示的三维模型。我们的方法可生成24个密集视图,并且在训练收敛速度上远优于现有方法(4 GPU小时对比数千GPU小时),同时保持相当的视觉质量和一致性。通过进一步微调,我们的方法在量化指标和视觉效果上均超越了现有最先进方法。项目页面:aigc3d.github.io/VideoMV。