Contrastive learning has shown promising potential for learning robust representations by utilizing unlabeled data. However, constructing effective positive-negative pairs for contrastive learning on facial behavior datasets remains challenging. This is because such pairs inevitably encode the subject-ID information, and the randomly constructed pairs may push similar facial images away due to the limited number of subjects in facial behavior datasets. To address this issue, we propose to utilize activity descriptions, coarse-grained information provided in some datasets, which can provide high-level semantic information about the image sequences but is often neglected in previous studies. More specifically, we introduce a two-stage Contrastive Learning with Text-Embeded framework for Facial behavior understanding (CLEF). The first stage is a weakly-supervised contrastive learning method that learns representations from positive-negative pairs constructed using coarse-grained activity information. The second stage aims to train the recognition of facial expressions or facial action units by maximizing the similarity between image and the corresponding text label names. The proposed CLEF achieves state-of-the-art performance on three in-the-lab datasets for AU recognition and three in-the-wild datasets for facial expression recognition.
翻译:对比学习在利用无标签数据学习鲁棒表征方面展现出巨大潜力。然而,在面部行为数据集中构建有效的正负样本对用于对比学习仍具挑战性。原因在于此类样本对不可避免地编码了受试者身份信息,而由于面部行为数据集的受试者数量有限,随机构建的样本对可能导致相似的面部图像被推离。为解决这一问题,我们提出利用活动描述——部分数据集提供的粗粒度信息,它能为图像序列提供高层语义信息,但以往研究中常被忽视。具体而言,我们引入一种面向面部行为理解的两阶段文本嵌入对比学习框架(CLEF)。第一阶段采用弱监督对比学习方法,利用粗粒度活动信息构建正负样本对学习表征;第二阶段通过最大化图像与对应文本标签名称之间的相似度,训练面部表情或面部动作单元的识别。所提出的CLEF在三个实验室环境下用于AU识别的数据集和三个野外环境下用于面部表情识别的数据集上均取得了最先进的性能。