Recent advancements in personalized image generation have significantly improved facial identity preservation, particularly in fields such as entertainment and social media. However, existing methods still struggle to achieve precise control over facial attributes in a per-subject-tuning-free (PSTF) way. Tuning-based techniques like PreciseControl have shown promise by providing fine-grained control over facial features, but they often require extensive technical expertise and additional training data, limiting their accessibility. In contrast, PSTF approaches simplify the process by enabling image generation from a single facial input, but they lack precise control over facial attributes. In this paper, we introduce a novel, PSTF method that enables both precise control over facial attributes and high-fidelity preservation of facial identity. Our approach utilizes a face recognition model to extract facial identity features, which are then mapped into the $W^+$ latent space of StyleGAN2 using the e4e encoder. We further enhance the model with a Triplet-Decoupled Cross-Attention module, which integrates facial identity, attribute features, and text embeddings into the UNet architecture, ensuring clean separation of identity and attribute information. Trained on the FFHQ dataset, our method allows for the generation of personalized images with fine-grained control over facial attributes, while without requiring additional fine-tuning or training data for individual identities. We demonstrate that our approach successfully balances personalization with precise facial attribute control, offering a more efficient and user-friendly solution for high-quality, adaptable facial image synthesis. The code is publicly available at https://github.com/UnicomAI/PSTF-AttControl.
翻译:近期个性化图像生成技术的进步显著提升了面部身份保真度,尤其在娱乐和社交媒体领域。然而,现有方法仍难以在无需逐主体调优(PSTF)的前提下实现对面部属性的精确控制。基于调优的技术(如PreciseControl)通过提供细粒度的面部特征控制展现出潜力,但通常需要大量技术专业知识和额外训练数据,限制了其可及性。相比之下,PSTF方法通过单张面部输入即可生成图像,简化了流程,但缺乏对面部属性的精确控制。本文提出一种新颖的PSTF方法,既能实现对面部属性的精确控制,又能保持面部身份的高保真度。我们的方法利用人脸识别模型提取面部身份特征,随后通过e4e编码器将其映射到StyleGAN2的$W^+$潜在空间。我们进一步引入三重解耦交叉注意力模块,将面部身份特征、属性特征和文本嵌入集成到UNet架构中,确保身份与属性信息的清晰分离。在FFHQ数据集上训练后,我们的方法能够生成具有细粒度面部属性控制的个性化图像,且无需针对个体身份进行额外微调或训练数据。实验表明,该方法成功平衡了个性化与精确面部属性控制,为高质量、可适应的面部图像合成提供了更高效、用户友好的解决方案。代码已公开于https://github.com/UnicomAI/PSTF-AttControl。