Deepfakes are a major security risk for biometric authentication. This technology creates realistic fake videos that can impersonate real people, fooling systems that rely on facial features and voice patterns for identification. Existing multimodal deepfake detectors rely on conventional fusion methods, such as majority rule and ensemble voting, which often struggle to adapt to changing data characteristics and complex patterns. In this paper, we introduce the Straight-through Gumbel-Softmax (STGS) framework, offering a comprehensive approach to search multimodal fusion model architectures. Using a two-level search approach, the framework optimizes the network architecture, parameters, and performance. Initially, crucial features were efficiently identified from backbone networks, whereas within the cell structure, a weighted fusion operation integrated information from various sources. An architecture that maximizes the classification performance is derived by varying parameters such as temperature and sampling time. The experimental results on the FakeAVCeleb and SWAN-DF datasets demonstrated an impressive AUC value 94.4\% achieved with minimal model parameters.
翻译:深度伪造技术对生物特征认证构成重大安全风险。该技术能生成逼真的伪造视频,冒充真实人物,欺骗依赖面部特征与语音模式进行识别的系统。现有的多模态深度伪造检测器多采用传统融合方法,如多数表决与集成投票,这些方法往往难以适应动态变化的数据特征与复杂模式。本文提出直通Gumbel-Softmax(STGS)框架,为多模态融合模型架构搜索提供系统性解决方案。该框架采用双层搜索策略,同步优化网络架构、参数与性能。首先从骨干网络中高效提取关键特征,进而在单元结构内通过加权融合操作整合多源信息。通过调节温度参数与采样次数等变量,最终得到分类性能最优的架构。在FakeAVCeleb与SWAN-DF数据集上的实验结果表明,该框架仅需极少模型参数即可实现94.4%的卓越AUC值。