Open-Set Video-based Facial Expression Recognition with Human Expression-sensitive Prompting

In Video-based Facial Expression Recognition (V-FER), models are typically trained on closed-set datasets with a fixed number of known classes. However, these models struggle with unknown classes common in real-world scenarios. In this paper, we introduce a challenging Open-set Video-based Facial Expression Recognition (OV-FER) task, aiming to identify both known and new, unseen facial expressions. While existing approaches use large-scale vision-language models like CLIP to identify unseen classes, we argue that these methods may not adequately capture the subtle human expressions needed for OV-FER. To address this limitation, we propose a novel Human Expression-Sensitive Prompting (HESP) mechanism to significantly enhance CLIP's ability to model video-based facial expression details effectively. Our proposed HESP comprises three components: 1) a textual prompting module with learnable prompts to enhance CLIP's textual representation of both known and unknown emotions, 2) a visual prompting module that encodes temporal emotional information from video frames using expression-sensitive attention, equipping CLIP with a new visual modeling ability to extract emotion-rich information, and 3) an open-set multi-task learning scheme that promotes interaction between the textual and visual modules, improving the understanding of novel human emotions in video sequences. Extensive experiments conducted on four OV-FER task settings demonstrate that HESP can significantly boost CLIP's performance (a relative improvement of 17.93% on AUROC and 106.18% on OSCR) and outperform other state-of-the-art open-set video understanding methods by a large margin. Code is available at https://github.com/cosinehuang/HESP.

翻译：在基于视频的面部表情识别（V-FER）中，模型通常在具有固定数量已知类别的闭集数据集上进行训练。然而，这些模型在处理现实场景中常见的未知类别时存在困难。本文提出了一项具有挑战性的开放集视频面部表情识别（OV-FER）任务，旨在同时识别已知的以及新的、未见过的面部表情。尽管现有方法利用如CLIP等大规模视觉-语言模型来识别未知类别，我们认为这些方法可能无法充分捕捉OV-FER所需的人类细微表情。为克服这一局限，我们提出了一种新颖的人类表情敏感提示（HESP）机制，以显著增强CLIP对基于视频的面部表情细节的建模能力。我们提出的HESP包含三个组成部分：1）一个带有可学习提示的文本提示模块，用于增强CLIP对已知和未知情绪的文本表示；2）一个视觉提示模块，通过表情敏感注意力从视频帧中编码时序情感信息，使CLIP具备提取富含情感信息的新视觉建模能力；3）一个开放集多任务学习方案，促进文本与视觉模块间的交互，从而提升对视频序列中新颖人类情感的理解。在四种OV-FER任务设置上进行的大量实验表明，HESP能显著提升CLIP的性能（在AUROC上相对提升17.93%，在OSCR上相对提升106.18%），并大幅超越其他最先进的开放集视频理解方法。代码发布于https://github.com/cosinehuang/HESP。