CLIPC8: Face liveness detection algorithm based on image-text pairs and contrastive learning

Face recognition technology is widely used in the financial field, and various types of liveness attack behaviors need to be addressed. Existing liveness detection algorithms are trained on specific training datasets and tested on testing datasets, but their performance and robustness in transferring to unseen datasets are relatively poor. To tackle this issue, we propose a face liveness detection method based on image-text pairs and contrastive learning, dividing liveness attack problems in the financial field into eight categories and using text information to describe the images of these eight types of attacks. The text encoder and image encoder are used to extract feature vector representations for the classification description text and face images, respectively. By maximizing the similarity of positive samples and minimizing the similarity of negative samples, the model learns shared representations between images and texts. The proposed method is capable of effectively detecting specific liveness attack behaviors in certain scenarios, such as those occurring in dark environments or involving the tampering of ID card photos. Additionally, it is also effective in detecting traditional liveness attack methods, such as printing photo attacks and screen remake attacks. The zero-shot capabilities of face liveness detection on five public datasets, including NUAA, CASIA-FASD, Replay-Attack, OULU-NPU and MSU-MFSD also reaches the level of commercial algorithms. The detection capability of proposed algorithm was verified on 5 types of testing datasets, and the results show that the method outperformed commercial algorithms, and the detection rates reached 100% on multiple datasets. Demonstrating the effectiveness and robustness of introducing image-text pairs and contrastive learning into liveness detection tasks as proposed in this paper.

翻译：人脸识别技术广泛应用于金融领域，各类活体攻击行为亟待应对。现有活体检测算法在特定训练数据集上训练并在测试数据集上验证，但其向未见数据集迁移时的性能与鲁棒性相对较弱。针对此问题，我们提出一种基于图像-文本对与对比学习的人脸活体检测方法，将金融领域的活体攻击问题划分为八种类别，并利用文本信息描述这八类攻击的图像。通过文本编码器与图像编码器分别提取分类描述文本和人脸图像的特征向量表示。通过最大化正样本相似度与最小化负样本相似度，模型学习图像与文本之间的共享表征。该方法能够有效检测特定场景下的活体攻击行为，例如暗光环境或身份证照片篡改等场景。同时，其对传统活体攻击方法（如打印照片攻击和屏幕翻拍攻击）亦具有检测效果。在NUAA、CASIA-FASD、Replay-Attack、OULU-NPU及MSU-MFSD五个公开数据集上，该方法的人脸活体检测零样本能力已达到商业算法水平。通过在五类测试数据集上验证所提算法的检测能力，结果表明该方法优于商业算法，且在多个数据集上检测率达到100%。这证明了本文提出的将图像-文本对与对比学习引入活体检测任务的有效性与鲁棒性。