Face anti-spoofing (FAS) method performs well under the intra-domain setups. But cross-domain performance of the model is not satisfying. Domain generalization method has been used to align the feature from different domain extracted by convolutional neural network (CNN) backbone. However, the improvement is limited. Recently, the Vision Transformer (ViT) model has performed well on various visual tasks. But ViT model relies heavily on pre-training of large-scale dataset, which cannot be satisfied by existing FAS datasets. In this paper, taking the FAS task as an example, we propose Masked Contrastive Autoencoder (MCAE) method to solve this problem using only limited data. Meanwhile in order for a feature extractor to extract common features in live samples from different domains, we combine Masked Image Model (MIM) with supervised contrastive learning to train our model.Some intriguing design principles are summarized for performing MIM pre-training for downstream tasks.We also provide insightful analysis for our method from an information theory perspective. Experimental results show our approach has good performance on extensive public datasets and outperforms the state-of-the-art methods.
翻译:人脸活体检测(FAS)方法在域内场景下表现良好,但其跨域性能尚不尽如人意。现有基于卷积神经网络(CNN)骨干的域泛化方法虽被用于对齐不同域提取的特征,但改进效果有限。近期,Vision Transformer(ViT)模型在多种视觉任务中展现出优异性能,但ViT模型高度依赖大规模数据集预训练,而现有FAS数据集难以满足这一需求。本文以FAS任务为例,提出仅利用有限数据即可解决该问题的掩码对比自编码器(MCAE)方法。同时,为促使特征提取器从不同域的活体样本中提取共性特征,我们将掩码图像建模(MIM)与监督对比学习相结合以训练模型,并总结了若干针对下游任务执行MIM预训练的设计原则。我们从信息论角度对方法进行了深入分析。实验表明,该方法在多个公开数据集上性能优异,超越了现有最先进方法。