The COVID-19 pandemic has undoubtedly changed the standards and affected all aspects of our lives, especially social life. It has forced people to extensively wear medical face masks, in order to prevent transmission. This face occlusion can strongly irritate emotional reading from the face and urges us to incorporate the whole body for emotion recognition, as it needs to play a more major role, despite its complementary nature. In this paper, we want to conduct insightful studies about the effect of face occlusion on emotion recognition performance, and showcase the superiority of full body input over plain masked face. We utilize a deep learning model based on the Temporal Segment Network framework and aspire to fully overcome the consequences of the face mask. Although single RGB stream models can adapt and learn both facial and bodily features, this may lead to irrelevant information confusion. By processing those features separately and fusing their preliminary prediction scores with a late fusion scheme, we are more effectively taking advantage of both modalities. This architecture can also naturally support temporal modeling, by mingling information among neighboring segment frames. Experimental results suggest that spatial structure plays a more important role for an emotional expression, while temporal structure is complementary.
翻译:新冠疫情无疑改变了标准,并影响了我们生活的方方面面,尤其是社交生活。它迫使人们广泛佩戴医用口罩以防止病毒传播。这种面部遮挡会严重干扰从面部读取情绪,并促使我们将整个身体纳入情绪识别,因为身体需要发挥更重要的作用,尽管它原本只是一种补充。在本文中,我们希望开展关于面部遮挡对情绪识别性能影响的深入研究,并展示完整身体输入相对于仅基于遮挡面部的优越性。我们利用了一个基于时间片段网络框架的深度学习模型,并力求完全克服口罩带来的影响。尽管单一的RGB流模型可以适应并学习面部和身体特征,但这可能导致无关信息的混淆。通过分别处理这些特征,并使用后期融合方案融合它们的初步预测分数,我们更有效地利用了两种模态。这种架构还可以通过混合相邻片段帧之间的信息,自然地支持时间建模。实验结果表明,空间结构在情绪表达中起着更重要的作用,而时间结构则是补充性的。