Facial expression datasets remain limited in scale due to privacy concerns, the subjectivity of annotations, and the labor-intensive nature of data collection. This limitation poses a significant challenge for developing modern deep learning-based facial expression analysis models, particularly foundation models, that rely on large-scale data for optimal performance. To tackle the overarching and complex challenge, we introduce SynFER (Synthesis of Facial Expressions with Refined Control), a novel framework for synthesizing facial expression image data based on high-level textual descriptions as well as more fine-grained and precise control through facial action units. To ensure the quality and reliability of the synthetic data, we propose a semantic guidance technique to steer the generation process and a pseudo-label generator to help rectify the facial expression labels for the synthetic images. To demonstrate the generation fidelity and the effectiveness of the synthetic data from SynFER, we conduct extensive experiments on representation learning using both synthetic data and real-world data. Experiment results validate the efficacy of the proposed approach and the synthetic data. Notably, our approach achieves a 67.23% classification accuracy on AffectNet when training solely with synthetic data equivalent to the AffectNet training set size, which increases to 69.84% when scaling up to five times the original size. Our code will be made publicly available.
翻译:由于隐私问题、标注的主观性以及数据收集的劳动密集型特性,面部表情数据集的规模仍然有限。这一限制对开发依赖大规模数据以获得最佳性能的现代基于深度学习的面部表情分析模型(特别是基础模型)构成了重大挑战。为应对这一全局性且复杂的挑战,我们提出了SynFER(具有精细化控制的面部表情合成),这是一个基于高层级文本描述以及通过面部动作单元进行更细粒度和精确控制来合成面部表情图像数据的新颖框架。为确保合成数据的质量和可靠性,我们提出了一种语义引导技术来引导生成过程,以及一个伪标签生成器来帮助修正合成图像的面部表情标签。为证明SynFER的生成保真度及其合成数据的有效性,我们使用合成数据和真实世界数据进行了广泛的表示学习实验。实验结果验证了所提方法及合成数据的有效性。值得注意的是,当仅使用与AffectNet训练集规模相当的合成数据进行训练时,我们的方法在AffectNet上达到了67.23%的分类准确率;当数据规模扩大到原始大小的五倍时,准确率提升至69.84%。我们的代码将公开提供。