SHIELD : An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models

Multimodal large language models (MLLMs) have demonstrated remarkable problem-solving capabilities in various vision fields (e.g., generic object recognition and grounding) based on strong visual semantic representation and language reasoning ability. However, whether MLLMs are sensitive to subtle visual spoof/forged clues and how they perform in the domain of face attack detection (e.g., face spoofing and forgery detection) is still unexplored. In this paper, we introduce a new benchmark, namely SHIELD, to evaluate the ability of MLLMs on face spoofing and forgery detection. Specifically, we design true/false and multiple-choice questions to evaluate multimodal face data in these two face security tasks. For the face anti-spoofing task, we evaluate three different modalities (i.e., RGB, infrared, depth) under four types of presentation attacks (i.e., print attack, replay attack, rigid mask, paper mask). For the face forgery detection task, we evaluate GAN-based and diffusion-based data with both visual and acoustic modalities. Each question is subjected to both zero-shot and few-shot tests under standard and chain of thought (COT) settings. The results indicate that MLLMs hold substantial potential in the face security domain, offering advantages over traditional specific models in terms of interpretability, multimodal flexible reasoning, and joint face spoof and forgery detection. Additionally, we develop a novel Multi-Attribute Chain of Thought (MA-COT) paradigm for describing and judging various task-specific and task-irrelevant attributes of face images, which provides rich task-related knowledge for subtle spoof/forged clue mining. Extensive experiments in separate face anti-spoofing, separate face forgery detection, and joint detection tasks demonstrate the effectiveness of the proposed MA-COT. The project is available at https$:$//github.com/laiyingxin2/SHIELD

翻译：多模态大语言模型（MLLMs）基于强大的视觉语义表征和语言推理能力，已在各类视觉领域（如通用物体识别与定位）展现出显著的问题解决能力。然而，MLLMs是否对细微的视觉欺骗/伪造线索敏感，以及它们在面部攻击检测领域（例如人脸欺骗和伪造检测）的表现如何，仍未被探索。本文引入了一个新基准——SHIELD，用于评估MLLMs在人脸欺骗和伪造检测中的能力。具体而言，我们设计了真假判断和多项选择题，以评估这两个面部安全任务中的多模态人脸数据。对于人脸反欺骗任务，我们评估了四种演示攻击类型（打印攻击、重放攻击、刚性面具、纸面具）下的三种不同模态（即RGB、红外、深度）。对于人脸伪造检测任务，我们评估了基于GAN和基于扩散的数据，同时涉及视觉和听觉模态。每个问题均在标准设置和思维链（COT）设置下进行零样本和少样本测试。结果表明，MLLMs在人脸安全领域具有巨大潜力，在可解释性、多模态灵活推理以及联合人脸欺骗与伪造检测方面，优于传统专用模型。此外，我们提出了一种新颖的多属性思维链（MA-COT）范式，用于描述和判断人脸图像的各种任务相关及任务无关属性，为细微的欺骗/伪造线索挖掘提供了丰富的任务相关知识。在单独人脸反欺骗、单独人脸伪造检测以及联合检测任务上的大量实验证明了所提出的MA-COT的有效性。项目地址：https$://github.com/laiyingxin2/SHIELD