Facial expression recognition (FER) is an essential task for understanding human behaviors. As one of the most informative behaviors of humans, facial expressions are often compound and variable, which is manifested by the fact that different people may express the same expression in very different ways. However, most FER methods still use one-hot or soft labels as the supervision, which lack sufficient semantic descriptions of facial expressions and are less interpretable. Recently, contrastive vision-language pre-training (VLP) models (e.g., CLIP) use text as supervision and have injected new vitality into various computer vision tasks, benefiting from the rich semantics in text. Therefore, in this work, we propose CLIPER, a unified framework for both static and dynamic facial Expression Recognition based on CLIP. Besides, we introduce multiple expression text descriptors (METD) to learn fine-grained expression representations that make CLIPER more interpretable. We conduct extensive experiments on several popular FER benchmarks and achieve state-of-the-art performance, which demonstrates the effectiveness of CLIPER.
翻译:面部表情识别(FER)是理解人类行为的关键任务。作为人类最具信息量的行为之一,面部表情往往具有复合性与可变性,表现为不同个体可能以截然不同的方式表达同一表情。然而,当前多数FER方法仍采用独热标签或软标签作为监督信号,缺乏对表情的充分语义描述且可解释性不足。近期,对比性视觉-语言预训练(VLP)模型(如CLIP)利用文本作为监督信号,借助文本中蕴含的丰富语义为各类计算机视觉任务注入了新活力。为此,本文提出CLIPER——一种基于CLIP的统一框架,可同时处理静态与动态面部表情识别。此外,我们引入多表达式文本描述符(METD)学习细粒度表情表征,显著提升CLIPER的可解释性。在多个主流FER基准数据集上的大量实验表明,CLIPER取得了最先进性能,验证了其有效性。