Face anti-spoofing (FAS) or presentation attack detection is an essential component of face recognition systems deployed in security-critical applications. Existing FAS methods have poor generalizability to unseen spoof types, camera sensors, and environmental conditions. Recently, vision transformer (ViT) models have been shown to be effective for the FAS task due to their ability to capture long-range dependencies among image patches. However, adaptive modules or auxiliary loss functions are often required to adapt pre-trained ViT weights learned on large-scale datasets such as ImageNet. In this work, we first show that initializing ViTs with multimodal (e.g., CLIP) pre-trained weights improves generalizability for the FAS task, which is in line with the zero-shot transfer capabilities of vision-language pre-trained (VLP) models. We then propose a novel approach for robust cross-domain FAS by grounding visual representations with the help of natural language. Specifically, we show that aligning the image representation with an ensemble of class descriptions (based on natural language semantics) improves FAS generalizability in low-data regimes. Finally, we propose a multimodal contrastive learning strategy to boost feature generalization further and bridge the gap between source and target domains. Extensive experiments on three standard protocols demonstrate that our method significantly outperforms the state-of-the-art methods, achieving better zero-shot transfer performance than five-shot transfer of adaptive ViTs. Code: https://github.com/koushiksrivats/FLIP
翻译:人脸反欺骗(FAS)或呈现攻击检测是部署在安全关键应用中人脸识别系统的重要组成部分。现有FAS方法对未知欺骗类型、摄像头传感器及环境条件的泛化能力较差。近年来,视觉变换器(ViT)模型因能捕捉图像块间的长距离依赖关系,已被证明对FAS任务有效。然而,通常需要自适应模块或辅助损失函数来调整在大规模数据集(如ImageNet)上预训练的ViT权重。在本工作中,我们首先证明使用多模态(例如CLIP)预训练权重初始化ViT可提升FAS任务的泛化能力,这与视觉语言预训练(VLP)模型的零样本迁移能力一致。随后,我们提出一种通过自然语言辅助视觉表征以实现鲁棒跨域FAS的新方法。具体而言,我们证明将图像表征与基于自然语言语义的类别描述集合对齐,能提升FAS在低数据场景下的泛化性。最后,我们提出一种多模态对比学习策略,以进一步增强特征泛化性并弥合源域与目标域之间的差距。在三个标准协议上的大量实验表明,我们的方法显著优于现有最先进方法,其零样本迁移性能甚至优于自适应ViT的五样本迁移。代码:https://github.com/koushiksrivats/FLIP