Whilst deep learning techniques have achieved excellent emotion prediction, they still require large amounts of labelled training data, which are (a) onerous and tedious to compile, and (b) prone to errors and biases. We propose Multi-Task Contrastive Learning for Affect Representation (\textbf{MT-CLAR}) for few-shot affect inference. MT-CLAR combines multi-task learning with a Siamese network trained via contrastive learning to infer from a pair of expressive facial images (a) the (dis)similarity between the facial expressions, and (b) the difference in valence and arousal levels of the two faces. We further extend the image-based MT-CLAR framework for automated video labelling where, given one or a few labelled video frames (termed \textit{support-set}), MT-CLAR labels the remainder of the video for valence and arousal. Experiments are performed on the AFEW-VA dataset with multiple support-set configurations; moreover, supervised learning on representations learnt via MT-CLAR are used for valence, arousal and categorical emotion prediction on the AffectNet and AFEW-VA datasets. The results show that valence and arousal predictions via MT-CLAR are very comparable to the state-of-the-art (SOTA), and we significantly outperform SOTA with a support-set $\approx$6\% the size of the video dataset.
翻译:尽管深度学习技术已实现卓越的情感预测性能,但其仍需要大量标注训练数据,这些数据存在两大问题:(a) 收集过程繁琐耗时,(b) 容易产生误差和偏差。我们提出面向情感表征的多任务对比学习框架(\textbf{MT-CLAR}),用于小样本情感推断。MT-CLAR通过将多任务学习与经对比学习训练的孪生网络相结合,可从成对的人脸表情图像中同时推断:(a) 面部表情的(不)相似度,(b) 两张人脸在效价与唤醒维度上的差异。我们进一步将基于图像的MT-CLAR框架扩展至自动化视频标注任务:在给定一个或少量标注视频帧(称为\textit{支持集})后,MT-CLAR可为视频中剩余帧标注效价与唤醒值。基于AFEW-VA数据集在多种支持集配置下开展实验;同时,采用MT-CLAR学习到的表征进行监督学习,在AffectNet和AFEW-VA数据集上开展效价、唤醒及分类情感预测任务。结果表明,通过MT-CLAR获得的效价与唤醒预测结果与当前最优方法(SOTA)高度可比,且在仅使用视频数据集约6%大小的支持集时,我们的性能显著超越SOTA。