Previous methods for dynamic facial expression recognition (DFER) in the wild are mainly based on Convolutional Neural Networks (CNNs), whose local operations ignore the long-range dependencies in videos. Transformer-based methods for DFER can achieve better performances but result in higher FLOPs and computational costs. To solve these problems, the local-global spatio-temporal Transformer (LOGO-Former) is proposed to capture discriminative features within each frame and model contextual relationships among frames while balancing the complexity. Based on the priors that facial muscles move locally and facial expressions gradually change, we first restrict both the space attention and the time attention to a local window to capture local interactions among feature tokens. Furthermore, we perform the global attention by querying a token with features from each local window iteratively to obtain long-range information of the whole video sequence. In addition, we propose the compact loss regularization term to further encourage the learned features have the minimum intra-class distance and the maximum inter-class distance. Experiments on two in-the-wild dynamic facial expression datasets (i.e., DFEW and FERV39K) indicate that our method provides an effective way to make use of the spatial and temporal dependencies for DFER.
翻译:先前针对野外动态面部表情识别(DFER)的方法主要基于卷积神经网络(CNN),其局部操作忽略了视频中的长距离依赖关系。基于Transformer的DFER方法虽能取得更优性能,但会导致更高的浮点运算次数和计算成本。为解决这些问题,本文提出局部-全局时空Transformer(LOGO-Former),在平衡复杂度的同时捕获每帧内的判别性特征并建模帧间上下文关系。基于面部肌肉局部运动且表情渐变这一先验知识,我们首先限制空间注意力和时间注意力的作用域为局部窗口,以捕获特征令牌间的局部交互;其次通过迭代查询每个局部窗口的特征令牌执行全局注意力,从而获取整个视频序列的长程信息。此外,我们提出紧凑损失正则化项,进一步促使所学特征具有最小类内距离和最大类间距离。在两个野外动态面部表情数据集(即DFEW和FERV39K)上的实验表明,本方法为利用时空依赖性进行DFER提供了一条有效途径。