Portrait Fidelity Generation is a prominent research area in generative models, with a primary focus on enhancing both controllability and fidelity. Current methods face challenges in generating high-fidelity portrait results when faces occupy a small portion of the image with a low resolution, especially in multi-person group photo settings. To tackle these issues, we propose a systematic solution called MagicID, based on a self-constructed million-level multi-modal dataset named IDZoom. MagicID consists of Multi-Mode Fusion training strategy (MMF) and DDIM Inversion based ID Restoration inference framework (DIIR). During training, MMF iteratively uses the skeleton and landmark modalities from IDZoom as conditional guidance. By introducing the Clone Face Tuning in training stage and Mask Guided Multi-ID Cross Attention (MGMICA) in inference stage, explicit constraints on face positional features are achieved for multi-ID group photo generation. The DIIR aims to address the issue of artifacts. The DDIM Inversion is used in conjunction with face landmarks, global and local face features to achieve face restoration while keeping the background unchanged. Additionally, DIIR is plug-and-play and can be applied to any diffusion-based portrait generation method. To validate the effectiveness of MagicID, we conducted extensive comparative and ablation experiments. The experimental results demonstrate that MagicID has significant advantages in both subjective and objective metrics, and achieves controllable generation in multi-person scenarios.
翻译:肖像保真度生成是生成模型领域的重要研究方向,其核心目标在于提升生成结果的可控性与保真度。现有方法在面部占图像比例较小且分辨率较低时(尤其在多人合影场景中)难以生成高保真度的肖像结果。为解决这些问题,我们提出名为MagicID的系统化解决方案,其基于自建百万级多模态数据集IDZoom构建。MagicID包含多模态融合训练策略(MMF)与基于DDIM反转的身份恢复推理框架(DIIR)。训练阶段,MMF迭代使用IDZoom中的骨骼与关键点模态作为条件引导。通过引入训练阶段的克隆面部微调与推理阶段的掩码引导多身份交叉注意力机制(MGMICA),实现了多人合影生成中对面部位置特征的显式约束。DIIR旨在解决伪影问题,通过结合DDIM反转与面部关键点、全局及局部面部特征,在保持背景不变的前提下实现面部恢复。此外,DIIR具备即插即用特性,可应用于任何基于扩散模型的肖像生成方法。为验证MagicID的有效性,我们进行了大量对比实验与消融实验。结果表明,MagicID在主客观评价指标上均具有显著优势,并实现了多人场景下的可控生成。