In this paper, we present our approach to addressing the challenges of the 7th ABAW competition. The competition comprises three sub-challenges: Valence Arousal (VA) estimation, Expression (Expr) classification, and Action Unit (AU) detection. To tackle these challenges, we employ state-of-the-art models to extract powerful visual features. Subsequently, a Transformer Encoder is utilized to integrate these features for the VA, Expr, and AU sub-challenges. To mitigate the impact of varying feature dimensions, we introduce an affine module to align the features to a common dimension. Overall, our results significantly outperform the baselines.
翻译:本文提出我们针对第七届ABAW竞赛挑战的解决方案。该竞赛包含三个子任务:效价-唤醒度(VA)估计、表情(Expr)分类以及动作单元(AU)检测。为应对这些挑战,我们采用前沿模型提取强表征力的视觉特征,随后利用Transformer编码器将这些特征分别整合至VA、Expr和AU子任务中。为缓解特征维度差异带来的影响,我们引入仿射模块将特征对齐至统一维度。总体而言,我们的实验结果显著超越了基准模型性能。