Anti-spoofing detection has become a necessity for face recognition systems due to the security threat posed by spoofing attacks. Despite great success in traditional attacks, most deep-learning-based methods perform poorly in 3D masks, which can highly simulate real faces in appearance and structure, suffering generalizability insufficiency while focusing only on the spatial domain with single frame input. This has been mitigated by the recent introduction of a biomedical technology called rPPG (remote photoplethysmography). However, rPPG-based methods are sensitive to noisy interference and require at least one second (> 25 frames) of observation time, which induces high computational overhead. To address these challenges, we propose a novel 3D mask detection framework, called FASTEN (Flow-Attention-based Spatio-Temporal aggrEgation Network). We tailor the network for focusing more on fine-grained details in large movements, which can eliminate redundant spatio-temporal feature interference and quickly capture splicing traces of 3D masks in fewer frames. Our proposed network contains three key modules: 1) a facial optical flow network to obtain non-RGB inter-frame flow information; 2) flow attention to assign different significance to each frame; 3) spatio-temporal aggregation to aggregate high-level spatial features and temporal transition features. Through extensive experiments, FASTEN only requires five frames of input and outperforms eight competitors for both intra-dataset and cross-dataset evaluations in terms of multiple detection metrics. Moreover, FASTEN has been deployed in real-world mobile devices for practical 3D mask detection.
翻译:防欺骗检测已成为人脸识别系统的必要手段,因为欺骗攻击构成了安全威胁。尽管在传统攻击方面取得了巨大成功,但大多数基于深度学习的方法在3D面具面前表现不佳——这类面具能高度模拟真实人脸的外观与结构,且在仅关注单帧输入空间域时存在泛化能力不足的问题。近期引入的生物医学技术rPPG(远程光电容积描记法)缓解了这一困境。然而,基于rPPG的方法对噪声干扰敏感,且需要至少1秒(>25帧)的观测时间,导致计算开销较大。为解决上述挑战,我们提出一种新型3D面具检测框架——FASTEN(基于流注意力的时空聚合网络)。该网络被特别设计为更关注大幅动作中的细微细节,从而消除冗余的时空特征干扰,并在更少帧数内快速捕捉3D面具的拼接痕迹。所提网络包含三个关键模块:1)面部光流网络,用于获取非RGB帧间流信息;2)流注意力机制,为每帧分配不同重要性;3)时空聚合模块,整合高层空间特征与时序转换特征。通过广泛实验,FASTEN仅需5帧输入,在数据集内和跨数据集评估中均以多项检测指标超越8个对比方法。此外,FASTEN已部署于实际移动设备中,用于现实场景的3D面具检测。