Recently, deep learning-based facial landmark detection for in-the-wild faces has achieved significant improvement. However, there are still challenges in face landmark detection in other domains (e.g. cartoon, caricature, etc). This is due to the scarcity of extensively annotated training data. To tackle this concern, we design a two-stage training approach that effectively leverages limited datasets and the pre-trained diffusion model to obtain aligned pairs of landmarks and face in multiple domains. In the first stage, we train a landmark-conditioned face generation model on a large dataset of real faces. In the second stage, we fine-tune the above model on a small dataset of image-landmark pairs with text prompts for controlling the domain. Our new designs enable our method to generate high-quality synthetic paired datasets from multiple domains while preserving the alignment between landmarks and facial features. Finally, we fine-tuned a pre-trained face landmark detection model on the synthetic dataset to achieve multi-domain face landmark detection. Our qualitative and quantitative results demonstrate that our method outperforms existing methods on multi-domain face landmark detection.
翻译:近年来,基于深度学习的自然场景人脸关键点检测取得了显著进展。然而,在其他域(如卡通、漫画等)中的人脸关键点检测仍面临挑战,这主要源于大规模标注训练数据的稀缺性。为解决这一问题,我们设计了一种两阶段训练方法,该方法能够有效利用有限数据集和预训练扩散模型,获得跨域的人脸与关键点对齐配对。第一阶段,我们在真实人脸大数据集上训练关键点条件人脸生成模型。第二阶段,我们在少量包含图像-关键点配对的小数据集上,结合文本提示进行域控制微调。这些创新设计使得我们的方法能够生成来自多个域的高质量合成配对数据集,同时保持关键点与人脸特征的对齐关系。最后,我们在合成数据集上微调预训练的人脸关键点检测模型,实现多域人脸关键点检测。定性与定量结果表明,我们的方法在多域人脸关键点检测任务上优于现有方法。