Editing real facial images is a crucial task in computer vision with significant demand in various real-world applications. While GAN-based methods have showed potential in manipulating images especially when combined with CLIP, these methods are limited in their ability to reconstruct real images due to challenging GAN inversion capability. Despite the successful image reconstruction achieved by diffusion-based methods, there are still challenges in effectively manipulating fine-gained facial attributes with textual instructions.To address these issues and facilitate convenient manipulation of real facial images, we propose a novel approach that conduct text-driven image editing in the semantic latent space of diffusion model. By aligning the temporal feature of the diffusion model with the semantic condition at generative process, we introduce a stable manipulation strategy, which perform precise zero-shot manipulation effectively. Furthermore, we develop an interactive system named ChatFace, which combines the zero-shot reasoning ability of large language models to perform efficient manipulations in diffusion semantic latent space. This system enables users to perform complex multi-attribute manipulations through dialogue, opening up new possibilities for interactive image editing. Extensive experiments confirmed that our approach outperforms previous methods and enables precise editing of real facial images, making it a promising candidate for real-world applications. Project page: https://dongxuyue.github.io/chatface/
翻译:编辑真实人脸图像是计算机视觉中的一项关键任务,在实际应用中需求巨大。尽管基于生成对抗网络(GAN)的方法在结合CLIP进行图像操控方面展现出潜力,但由于GAN反演能力的局限,这些方法在重建真实图像方面仍受限。虽然基于扩散模型的方法成功实现了图像重建,但在通过文本指令有效操控精细面部属性方面仍面临挑战。为解决这些问题并实现便捷的真实人脸图像操控,我们提出了一种新方法,在扩散模型的语义潜空间中进行文本驱动的图像编辑。通过将扩散模型的时间特征与生成过程中的语义条件对齐,我们引入了一种稳定的操控策略,能够有效执行精确的零样本操控。此外,我们开发了一个名为ChatFace的交互式系统,该系统结合了大语言模型的零样本推理能力,可在扩散语义潜空间中执行高效操控。该系统使用户能够通过对话实现复杂的多属性操控,为交互式图像编辑开辟了新可能。大量实验证实,我们的方法优于先前方法,能够精确编辑真实人脸图像,使其成为实际应用中的有前景方案。项目页面:https://dongxuyue.github.io/chatface/