Dynamic Facial Expression Recognition (DFER) is crucial for understanding human behavior. However, current methods exhibit limited performance mainly due to the scarcity of high-quality data, the insufficient utilization of facial dynamics, and the ambiguity of expression semantics, etc. To this end, we propose a novel framework, named Multi-modal Fine-grained CLIP for Dynamic Facial Expression Recognition with AdaptERs (FineCLIPER), incorporating the following novel designs: 1) To better distinguish between similar facial expressions, we extend the class labels to textual descriptions from both positive and negative aspects, and obtain supervision by calculating the cross-modal similarity based on the CLIP model; 2) Our FineCLIPER adopts a hierarchical manner to effectively mine useful cues from DFE videos. Specifically, besides directly embedding video frames as input (low semantic level), we propose to extract the face segmentation masks and landmarks based on each frame (middle semantic level) and utilize the Multi-modal Large Language Model (MLLM) to further generate detailed descriptions of facial changes across frames with designed prompts (high semantic level). Additionally, we also adopt Parameter-Efficient Fine-Tuning (PEFT) to enable efficient adaptation of large pre-trained models (i.e., CLIP) for this task. Our FineCLIPER achieves SOTA performance on the DFEW, FERV39k, and MAFW datasets in both supervised and zero-shot settings with few tunable parameters. Analysis and ablation studies further validate its effectiveness.
翻译:动态面部表情识别对于理解人类行为至关重要。然而,现有方法性能有限,主要归因于高质量数据稀缺、面部动态利用不足以及表情语义模糊等问题。为此,我们提出一种新颖框架,称为基于适配器的多模态细粒度CLIP动态面部表情识别(FineCLIPER),其包含以下创新设计:1) 为更好地区分相似面部表情,我们将类别标签扩展为包含正反两方面的文本描述,并基于CLIP模型计算跨模态相似度以获得监督信号;2) FineCLIPER采用分层方式从动态面部表情视频中有效挖掘有用线索。具体而言,除了直接嵌入视频帧作为输入(低语义层级),我们提出基于每帧提取面部分割掩码与关键点(中语义层级),并利用多模态大语言模型通过设计提示词进一步生成跨帧面部变化的详细描述(高语义层级)。此外,我们还采用参数高效微调技术,使大规模预训练模型(即CLIP)能够高效适配此任务。FineCLIPER在DFEW、FERV39k和MAFW数据集上,于监督和零样本设置下均以少量可调参数实现了最先进的性能。分析与消融研究进一步验证了其有效性。