We introduce MVControl, a novel neural network architecture that enhances existing pre-trained multi-view 2D diffusion models by incorporating additional input conditions, e.g. edge maps. Our approach enables the generation of controllable multi-view images and view-consistent 3D content. To achieve controllable multi-view image generation, we leverage MVDream as our base model, and train a new neural network module as additional plugin for end-to-end task-specific condition learning. To precisely control the shapes and views of generated images, we innovatively propose a new conditioning mechanism that predicts an embedding encapsulating the input spatial and view conditions, which is then injected to the network globally. Once MVControl is trained, score-distillation (SDS) loss based optimization can be performed to generate 3D content, in which process we propose to use a hybrid diffusion prior. The hybrid prior relies on a pre-trained Stable-Diffusion network and our trained MVControl for additional guidance. Extensive experiments demonstrate that our method achieves robust generalization and enables the controllable generation of high-quality 3D content.
翻译:我们提出MVControl,一种新颖的神经网络架构,通过整合额外输入条件(如边缘图)来增强现有的预训练多视角二维扩散模型。该方法能够生成可控的多视角图像及视角一致的3D内容。为实现可控的多视角图像生成,我们以MVDream为基础模型,并训练一个新的神经网络模块作为附加插件,用于端到端的任务特定条件学习。为了精确控制生成图像的形状与视角,我们创新性地提出了一种新的条件机制:预测一个封装输入空间与视角条件的嵌入向量,并将其全局注入网络。在MVControl训练完成后,可基于分数蒸馏(SDS)损失优化生成3D内容,在此过程中我们提出使用混合扩散先验。该混合先验依赖于预训练的Stable-Diffusion网络及我们训练的MVControl提供额外引导。大量实验表明,该方法实现了鲁棒的泛化能力,并能够可控生成高质量的3D内容。