Existing 3D facial emotion modeling have been constrained by limited emotion classes and insufficient datasets. This paper introduces "Emo3D", an extensive "Text-Image-Expression dataset" spanning a wide spectrum of human emotions, each paired with images and 3D blendshapes. Leveraging Large Language Models (LLMs), we generate a diverse array of textual descriptions, facilitating the capture of a broad spectrum of emotional expressions. Using this unique dataset, we conduct a comprehensive evaluation of language-based models' fine-tuning and vision-language models like Contranstive Language Image Pretraining (CLIP) for 3D facial expression synthesis. We also introduce a new evaluation metric for this task to more directly measure the conveyed emotion. Our new evaluation metric, Emo3D, demonstrates its superiority over Mean Squared Error (MSE) metrics in assessing visual-text alignment and semantic richness in 3D facial expressions associated with human emotions. "Emo3D" has great applications in animation design, virtual reality, and emotional human-computer interaction.
翻译:现有的三维面部情感建模一直受限于有限的情感类别和不足的数据集。本文介绍了"Emo3D",一个涵盖广泛人类情感谱系的、大规模的"文本-图像-表情数据集",其中每个情感均配有相应的图像和三维混合形状。通过利用大语言模型,我们生成了多样化的文本描述,从而有助于捕捉广泛的情感表达。基于这一独特数据集,我们对基于语言的模型微调以及像对比语言-图像预训练模型这样的视觉-语言模型在三维面部表情合成任务上进行了全面评估。我们还为此任务引入了一种新的评估指标,以更直接地衡量所传达的情感。我们提出的新评估指标Emo3D,在评估与人类情感相关的三维面部表情的视觉-文本对齐度及语义丰富性方面,展现了其相对于均方误差指标的优越性。"Emo3D"在动画设计、虚拟现实以及情感化人机交互领域具有重要应用价值。