We present DiffPortrait3D, a conditional diffusion model that is capable of synthesizing 3D-consistent photo-realistic novel views from as few as a single in-the-wild portrait. Specifically, given a single RGB input, we aim to synthesize plausible but consistent facial details rendered from novel camera views with retained both identity and facial expression. In lieu of time-consuming optimization and fine-tuning, our zero-shot method generalizes well to arbitrary face portraits with unposed camera views, extreme facial expressions, and diverse artistic depictions. At its core, we leverage the generative prior of 2D diffusion models pre-trained on large-scale image datasets as our rendering backbone, while the denoising is guided with disentangled attentive control of appearance and camera pose. To achieve this, we first inject the appearance context from the reference image into the self-attention layers of the frozen UNets. The rendering view is then manipulated with a novel conditional control module that interprets the camera pose by watching a condition image of a crossed subject from the same view. Furthermore, we insert a trainable cross-view attention module to enhance view consistency, which is further strengthened with a novel 3D-aware noise generation process during inference. We demonstrate state-of-the-art results both qualitatively and quantitatively on our challenging in-the-wild and multi-view benchmarks.
翻译:我们提出DiffPortrait3D,一种条件扩散模型,能够从单张自然场景人像(甚至仅此一张)合成具有三维一致性的照片级逼真新视角。具体而言,给定单张RGB输入图像,我们旨在合成从新相机视角渲染的、保留身份特征与表情信息且具备合理一致性的面部细节。本方法无需耗时的优化与微调,作为零样本方法,能够普遍适用于任意面部人像,包括非摆拍相机视角、极端面部表情及多样艺术化描绘。其核心在于,我们利用在大规模图像数据集上预训练的二维扩散模型的生成先验作为渲染主干,同时通过解耦的外观与相机姿态注意力控制来引导去噪过程。为实现此目标,我们首先将参考图像中的外观上下文注入冻结UNet的自注意力层,随后通过一种新颖的条件控制模块操作渲染视角——该模块通过观测同一视角下交叉主体的条件图像来解析相机姿态。此外,我们插入可训练的交叉视角注意力模块以增强视角一致性,并在推理阶段通过新颖的三维感知噪声生成过程进一步强化该特性。在具有挑战性的自然场景与多视角基准测试中,我们通过定性与定量评估证明了方法达到现有最优水平。