VideoMV: Consistent Multi-View Generation Based on Large Video Generative Model

Generating multi-view images based on text or single-image prompts is a critical capability for the creation of 3D content. Two fundamental questions on this topic are what data we use for training and how to ensure multi-view consistency. This paper introduces a novel framework that makes fundamental contributions to both questions. Unlike leveraging images from 2D diffusion models for training, we propose a dense consistent multi-view generation model that is fine-tuned from off-the-shelf video generative models. Images from video generative models are more suitable for multi-view generation because the underlying network architecture that generates them employs a temporal module to enforce frame consistency. Moreover, the video data sets used to train these models are abundant and diverse, leading to a reduced train-finetuning domain gap. To enhance multi-view consistency, we introduce a 3D-Aware Denoising Sampling, which first employs a feed-forward reconstruction module to get an explicit global 3D model, and then adopts a sampling strategy that effectively involves images rendered from the global 3D model into the denoising sampling loop to improve the multi-view consistency of the final images. As a by-product, this module also provides a fast way to create 3D assets represented by 3D Gaussians within a few seconds. Our approach can generate 24 dense views and converges much faster in training than state-of-the-art approaches (4 GPU hours versus many thousand GPU hours) with comparable visual quality and consistency. By further fine-tuning, our approach outperforms existing state-of-the-art methods in both quantitative metrics and visual effects. Our project page is aigc3d.github.io/VideoMV.

翻译：基于文本或单图像提示生成多视角图像是三维内容创作的关键能力。该主题的两个基本问题在于：训练数据的来源以及如何确保多视角一致性。本文提出了一种新颖框架，对这两个问题做出了根本性贡献。与使用二维扩散模型图像进行训练不同，我们提出了一种密集一致的多视图生成模型，该模型从现成的视频生成模型微调而来。视频生成模型中的图像更适合多视角生成，因为其底层网络架构采用了时间模块来强制帧一致性。此外，用于训练这些模型的视频数据集丰富多样，从而减小了训练-微调之间的领域鸿沟。为了增强多视图一致性，我们引入了三维感知去噪采样方法：首先采用前馈重建模块获取显式全局三维模型，然后采用一种采样策略，将全局三维模型渲染的图像有效融入去噪采样循环中，以提升最终图像的多视图一致性。作为副产品，该模块还提供了一种快速创建三维资产的方法，可在数秒内生成以三维高斯表示的三维模型。我们的方法可生成24个密集视图，并且在训练收敛速度上远优于现有方法（4 GPU小时对比数千GPU小时），同时保持相当的视觉质量和一致性。通过进一步微调，我们的方法在量化指标和视觉效果上均超越了现有最先进方法。项目页面：aigc3d.github.io/VideoMV。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日