We present DiffPortrait3D, a conditional diffusion model that is capable of synthesizing 3D-consistent photo-realistic novel views from as few as a single in-the-wild portrait. Specifically, given a single RGB input, we aim to synthesize plausible but consistent facial details rendered from novel camera views with retained both identity and facial expression. In lieu of time-consuming optimization and fine-tuning, our zero-shot method generalizes well to arbitrary face portraits with unposed camera views, extreme facial expressions, and diverse artistic depictions. At its core, we leverage the generative prior of 2D diffusion models pre-trained on large-scale image datasets as our rendering backbone, while the denoising is guided with disentangled attentive control of appearance and camera pose. To achieve this, we first inject the appearance context from the reference image into the self-attention layers of the frozen UNets. The rendering view is then manipulated with a novel conditional control module that interprets the camera pose by watching a condition image of a crossed subject from the same view. Furthermore, we insert a trainable cross-view attention module to enhance view consistency, which is further strengthened with a novel 3D-aware noise generation process during inference. We demonstrate state-of-the-art results both qualitatively and quantitatively on our challenging in-the-wild and multi-view benchmarks.
翻译:我们提出DiffPortrait3D,一种能够从单张野外肖像图像合成三维一致、照片级真实感新视角的条件扩散模型。具体而言,给定单张RGB输入图像,我们旨在合成从新相机视角渲染的合理且一致的面部细节,同时保留身份特征与面部表情。本方法无需耗时的优化与微调,是一种零样本方法,可泛化至任意面部肖像,涵盖非摆拍相机视角、极端面部表情及多样艺术化呈现。其核心在于,我们利用基于大规模图像数据集预训练的二维扩散模型的生成先验作为渲染主干,同时通过解耦的外观与相机姿态注意力控制引导去噪过程。为此,我们首先将参考图像中的外观上下文注入冻结UNet的自注意力层;随后通过新颖的条件控制模块操纵渲染视角——该模块通过观察同一视角下交叉主体的条件图像来解读相机姿态。此外,我们插入可训练的跨视角注意力模块以增强视角一致性,并在推理阶段通过新颖的三维感知噪声生成过程进一步强化该一致性。在具有挑战性的野外和多视角基准测试中,我们定性与定量结果均达到最优水平。