The predominant approach to facial action unit (AU) detection revolves around a supervised multi-label binary classification problem. Existing methodologies often encode pixel-level information of AUs, thereby imposing substantial demands on model complexity and expressiveness. Moreover, this practice elevates the susceptibility to overfitting due to the presence of noisy AU labels. In the present study, we introduce a contrastive learning framework enhanced by both supervised and self-supervised signals. The objective is to acquire discriminative features, deviating from the conventional pixel-level learning paradigm within the domain of AU detection. To address the challenge posed by noisy AU labels, we augment the supervised signal through the introduction of a self-supervised signal. This augmentation is achieved through positive sample sampling, encompassing three distinct types of positive sample pairs. Furthermore, to mitigate the imbalanced distribution of each AU type, we employ an importance re-weighting strategy tailored for minority AUs. The resulting loss, denoted as AUNCE, is proposed to encapsulate this strategy. Our experimental assessments, conducted on two widely-utilized benchmark datasets (BP4D and DISFA), underscore the superior performance of our approach compared to state-of-the-art methods in the realm of AU detection.
翻译:面部动作单元(AU)检测的主流方法围绕有监督的多标签二分类问题展开。现有方法通常编码AU的像素级信息,从而对模型复杂度和表达能力提出了高要求。此外,这种做法因存在噪声AU标签而增加了过拟合风险。在本研究中,我们引入了一种由监督信号和自监督信号共同增强的对比学习框架。其目标是获取判别性特征,以偏离AU检测领域传统的像素级学习范式。为应对噪声AU标签带来的挑战,我们通过引入自监督信号来增强监督信号。这种增强通过正样本采样实现,涵盖了三种不同类型的正样本对。此外,为缓解每种AU类别的不平衡分布,我们采用了一种针对少数AU的重要性重加权策略。由此产生的损失函数被命名为AUNCE,用于体现这一策略。我们在两个广泛使用的基准数据集(BP4D 和 DISFA)上进行的实验评估表明,与AU检测领域的最新方法相比,我们的方法具有更优越的性能。