Recognising continuous emotions and action unit (AU) intensities from face videos requires a spatial and temporal understanding of expression dynamics. Existing works primarily rely on 2D face appearances to extract such dynamics. This work focuses on a promising alternative based on parametric 3D face shape alignment models, which disentangle different factors of variation, including expression-induced shape variations. We aim to understand how expressive 3D face shapes are in estimating valence-arousal and AU intensities compared to the state-of-the-art 2D appearance-based models. We benchmark four recent 3D face alignment models: ExpNet, 3DDFA-V2, DECA, and EMOCA. In valence-arousal estimation, expression features of 3D face models consistently surpassed previous works and yielded an average concordance correlation of .739 and .574 on SEWA and AVEC 2019 CES corpora, respectively. We also study how 3D face shapes performed on AU intensity estimation on BP4D and DISFA datasets, and report that 3D face features were on par with 2D appearance features in AUs 4, 6, 10, 12, and 25, but not the entire set of AUs. To understand this discrepancy, we conduct a correspondence analysis between valence-arousal and AUs, which points out that accurate prediction of valence-arousal may require the knowledge of only a few AUs.
翻译:从人脸视频中识别连续情绪和动作单元(AU)强度需要对面部表情动态进行时空理解。现有研究主要依赖二维人脸外观提取此类动态特征。本研究聚焦于基于参数化三维人脸形状对齐模型的替代方案,该模型能够解耦包括表情诱发形状变化在内的多种变异因素。我们旨在探究相较于基于二维外观的最先进模型,三维人脸形状在估计效价-唤醒度和动作单元强度方面具有多大表现力。我们对四种最新三维人脸对齐模型进行基准测试:ExpNet、3DDFA-V2、DECA和EMOCA。在效价-唤醒度估计任务中,三维人脸模型的表情特征持续超越先前研究,在SEWA和AVEC 2019 CES语料库上分别取得0.739和0.574的平均一致性相关系数。我们还研究了三维人脸形状在BP4D和DISFA数据集上对动作单元强度估计的性能表现,结果表明三维人脸特征在AU 4、6、10、12和25上可与二维外观特征相媲美,但并非适用于全部动作单元。为理解这一差异,我们进行了效价-唤醒度与动作单元之间的对应分析,结果表明准确预测效价-唤醒度可能只需要掌握少数动作单元的知识。