Facial expression spotting, identifying periods where facial expressions occur in a video, is a significant yet challenging task in facial expression analysis. The issues of irrelevant facial movements and the challenge of detecting subtle motions in micro-expressions remain unresolved, hindering accurate expression spotting. In this paper, we propose an efficient framework for facial expression spotting. First, we propose a Sliding Window-based Multi-Resolution Optical flow (SW-MRO) feature, which calculates multi-resolution optical flow of the input image sequence within compact sliding windows. The window length is tailored to perceive complete micro-expressions and distinguish between general macro- and micro-expressions. SW-MRO can effectively reveal subtle motions while avoiding severe head movement problems. Second, we propose SpotFormer, a multi-scale spatio-temporal Transformer that simultaneously encodes spatio-temporal relationships of the SW-MRO features for accurate frame-level probability estimation. In SpotFormer, our proposed Facial Local Graph Pooling (FLGP) and convolutional layers are applied for multi-scale spatio-temporal feature extraction. We show the validity of the architecture of SpotFormer by comparing it with several model variants. Third, we introduce supervised contrastive learning into SpotFormer to enhance the discriminability between different types of expressions. Extensive experiments on SAMM-LV and CAS(ME)^2 show that our method outperforms state-of-the-art models, particularly in micro-expression spotting.
翻译:面部表情定位是面部表情分析中一项重要但具有挑战性的任务,其目标在于识别视频中出现面部表情的时段。无关的面部运动问题以及检测微表情中细微动作的挑战仍未得到解决,这阻碍了准确的表情定位。本文提出了一种高效的面部表情定位框架。首先,我们提出了一种基于滑动窗口的多分辨率光流特征,该特征在紧凑的滑动窗口内计算输入图像序列的多分辨率光流。窗口长度经过专门设计,以感知完整的微表情并区分一般的宏表情与微表情。SW-MRO 能有效揭示细微动作,同时避免严重的头部运动问题。其次,我们提出了 SpotFormer,一种多尺度时空 Transformer,它同时编码 SW-MRO 特征的时空关系,以实现准确的帧级概率估计。在 SpotFormer 中,我们提出的面部局部图池化层和卷积层被用于多尺度时空特征提取。通过与多个模型变体进行比较,我们验证了 SpotFormer 架构的有效性。第三,我们将监督对比学习引入 SpotFormer,以增强不同类型表情之间的可区分性。在 SAMM-LV 和 CAS(ME)^2 数据集上进行的大量实验表明,我们的方法优于最先进的模型,尤其是在微表情定位方面。