Micro-expression recognition (MER) aims to recognize the short and subtle facial movements from the Micro-expression (ME) video clips, which reveal real emotions. Recent MER methods mostly only utilize special frames from ME video clips or extract optical flow from these special frames. However, they neglect the relationship between movements and space-time, while facial cues are hidden within these relationships. To solve this issue, we propose the Hierarchical Space-Time Attention (HSTA). Specifically, we first process ME video frames and special frames or data parallelly by our cascaded Unimodal Space-Time Attention (USTA) to establish connections between subtle facial movements and specific facial areas. Then, we design Crossmodal Space-Time Attention (CSTA) to achieve a higher-quality fusion for crossmodal data. Finally, we hierarchically integrate USTA and CSTA to grasp the deeper facial cues. Our model emphasizes temporal modeling without neglecting the processing of special data, and it fuses the contents in different modalities while maintaining their respective uniqueness. Extensive experiments on the four benchmarks show the effectiveness of our proposed HSTA. Specifically, compared with the latest method on the CASME3 dataset, it achieves about 3% score improvement in seven-category classification.
翻译:微表情识别(MER)旨在识别微表情(ME)视频片段中短暂而微妙的面部运动,这些运动揭示了真实情绪。近期的大多数MER方法仅利用ME视频片段中的特殊帧,或从这些特殊帧中提取光流。然而,它们忽略了运动与时空之间的关系,而面部线索正隐藏在这些关系中。为解决这一问题,我们提出分层时空注意力机制(HSTA)。具体而言,我们首先通过级联的单模态时空注意力机制(USTA)并行处理ME视频帧与特殊帧或数据,以建立微妙面部运动与特定面部区域之间的联系。随后,我们设计了跨模态时空注意力机制(CSTA),实现跨模态数据的高质量融合。最后,我们将USTA与CSTA分层整合,以捕获更深层的面部线索。我们的模型在强调时间建模的同时,并未忽视特殊数据的处理,并在融合不同模态内容的同时保持各自的独特性。在四个基准数据集上进行的大量实验表明,我们提出的HSTA具有有效性。具体而言,与CASME3数据集上的最新方法相比,其在七分类任务中实现了约3%的性能提升。