This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN .
翻译:本文提出一种自监督方法,用于从视频中学习通用面部表征,该方法可迁移至多种面部分析任务,包括面部属性识别(FAR)、面部表情识别(FER)、深度伪造检测(DFD)及唇形同步(LS)。所提出的框架名为MARLIN,是一种面部视频掩码自编码器,能够从大量可获取的非标注网络爬取面部视频中学习高度鲁棒且通用的面部嵌入。作为一种具有挑战性的辅助任务,MARLIN通过从密集掩码的面部区域(主要包括眼睛、鼻子、嘴巴、嘴唇及皮肤)中重建面部的时空细节,捕获局部与全局特征,从而有助于编码通用且可迁移的表征。通过在多种下游任务上的广泛实验,我们证明MARLIN是一种优秀的面部视频编码器及特征提取器,在FAR(相比有监督基准提升1.13%)、FER(相比无监督基准提升2.64%)、DFD(相比无监督基准提升1.86%)、LS(Fréchet初始距离提升29.36%)等各类下游任务中表现稳定,即便在低数据场景下亦如此。我们的代码和模型发布于https://github.com/ControlNet/MARLIN。