Human face generation and editing represent an essential task in the era of computer vision and the digital world. Recent studies have shown remarkable progress in multi-modal face generation and editing, for instance, using face segmentation to guide image generation. However, it may be challenging for some users to create these conditioning modalities manually. Thus, we introduce M3Face, a unified multi-modal multilingual framework for controllable face generation and editing. This framework enables users to utilize only text input to generate controlling modalities automatically, for instance, semantic segmentation or facial landmarks, and subsequently generate face images. We conduct extensive qualitative and quantitative experiments to showcase our frameworks face generation and editing capabilities. Additionally, we propose the M3CelebA Dataset, a large-scale multi-modal and multilingual face dataset containing high-quality images, semantic segmentations, facial landmarks, and different captions for each image in multiple languages. The code and the dataset will be released upon publication.
翻译:人脸生成与编辑是计算机视觉及数字世界时代的重要任务。近年来,多模态人脸生成与编辑研究取得了显著进展,例如利用人脸分割引导图像生成。然而,对部分用户而言,手动创建这些条件模态可能具有挑战性。为此,我们提出M3Face——一个面向可控人脸生成与编辑的统一多模态多语言框架。该框架允许用户仅通过文本输入自动生成控制模态(如语义分割或面部关键点),进而生成人脸图像。我们通过广泛的定性与定量实验展示了框架在人脸生成与编辑方面的能力。此外,我们提出了M3CelebA数据集——一个大规模多模态多语言人脸数据集,其中包含高质量图像、语义分割、面部关键点以及多语言注释描述。代码与数据集将在论文发表后公开。