Diffusion-Based Makeup Transfer with Facial Region-Aware Makeup Features

Current diffusion-based makeup transfer methods commonly use the makeup information encoded by off-the-shelf foundation models (e.g., CLIP) as condition to preserve the makeup style of reference image in the generation. Although effective, these works mainly have two limitations: (1) foundation models pre-trained for generic tasks struggle to capture makeup styles; (2) the makeup features of reference image are injected to the diffusion denoising model as a whole for global makeup transfer, overlooking the facial region-aware makeup features (i.e., eyes, mouth, etc) and limiting the regional controllability for region-specific makeup transfer. To address these, in this work, we propose Facial Region-Aware Makeup features (FRAM), which has two stages: (1) makeup CLIP fine-tuning; (2) identity and facial region-aware makeup injection. For makeup CLIP fine-tuning, unlike prior works using off-the-shelf CLIP, we synthesize annotated makeup style data using GPT-o3 and text-driven image editing model, and then use the data to train a makeup CLIP encoder through self-supervised and image-text contrastive learning. For identity and facial region-aware makeup injection, we construct before-and-after makeup image pairs from the edited images in stage 1 and then use them to learn to inject identity of source image and makeup of reference image to the diffusion denoising model for makeup transfer. Specifically, we use learnable tokens to query the makeup CLIP encoder to extract facial region-aware makeup features for makeup injection, which is learned via an attention loss to enable regional control. As for identity injection, we use a ControlNet Union to encode source image and its 3D mesh simultaneously. The experimental results verify the superiority of our regional controllability and our makeup transfer performance.

翻译：当前基于扩散模型的化妆迁移方法通常利用通用基础模型（如CLIP）编码的化妆信息作为条件，在生成过程中保留参考图像的化妆风格。然而，这些方法主要存在两个局限：（1）为通用任务预训练的基础模型难以准确捕捉化妆风格；（2）参考图像的化妆特征被整体注入扩散去噪模型以实现全局化妆迁移，忽略了面部区域感知的化妆特征（如眼睛、嘴巴等），限制了区域特定化妆迁移的可控性。针对这些问题，本文提出面部区域感知的化妆特征（FRAM），包含两个阶段：（1）化妆CLIP微调；（2）身份与面部区域感知的化妆注入。在化妆CLIP微调阶段，与以往直接使用通用CLIP的方法不同，我们利用GPT-o3和文本驱动图像编辑模型合成带标注的化妆风格数据，并通过自监督学习和图像-文本对比学习训练化妆CLIP编码器。在身份与面部区域感知的化妆注入阶段，我们利用第一阶段编辑图像构建化妆前后图像对，学习将源图像的身份特征与参考图像的化妆特征注入扩散去噪模型。具体而言，我们使用可学习查询标记从化妆CLIP编码器中提取面部区域感知的化妆特征，并通过注意力损失函数实现区域级控制。身份注入方面，我们采用ControlNet Union同时编码源图像及其三维网格。实验结果验证了该方法在区域可控性与化妆迁移性能上的优越性。