Facial Action Units detection (FAUs) represents a fine-grained classification problem that involves identifying different units on the human face, as defined by the Facial Action Coding System. In this paper, we present a simple yet efficient Vision Transformer-based approach for addressing the task of Action Units (AU) detection in the context of Affective Behavior Analysis in-the-wild (ABAW) competition. We employ the Video Vision Transformer(ViViT) Network to capture the temporal facial change in the video. Besides, to reduce massive size of the Vision Transformers model, we replace the ViViT feature extraction layers with the CNN backbone (Regnet). Our model outperform the baseline model of ABAW 2023 challenge, with a notable 14% difference in result. Furthermore, the achieved results are comparable to those of the top three teams in the previous ABAW 2022 challenge.
翻译:面部动作单元检测(FAUs)是一个精细分类问题,涉及识别人体面部不同单元,由面部动作编码系统定义。本文提出了一种简单高效的基于Vision Transformer的方法,用于解决自然场景情感行为分析(ABAW)竞赛中的动作单元(AU)检测任务。我们采用视频Vision Transformer(ViViT)网络捕捉视频中的时序面部变化。此外,为减小Vision Transformer模型的庞大体积,我们用CNN骨干网络(Regnet)替换了ViViT的特征提取层。我们的模型在ABAW 2023挑战赛中超越了基线模型,结果差异显著达到14%。同时,所获结果与先前ABAW 2022挑战赛前三名队伍的表现相当。