Since deepfakes generated by advanced generative models have rapidly posed serious threats, existing audiovisual deepfake detection approaches struggle to generalize to unseen forgeries. We propose a novel reference-aware audiovisual deepfake detection method, called Referee. Speaker-specific cues from only one-shot examples are leveraged to detect manipulations beyond spatiotemporal artifacts. By matching and aligning identity-related queries from reference and target content into cross-modal features, Referee jointly reasons about audiovisual synchrony and identity consistency. Extensive experiments on FakeAVCeleb, FaceForensics++, and KoDF demonstrate that Referee achieves state-of-the-art performance on cross-dataset and cross-language evaluation protocols. Experimental results highlight the importance of cross-modal identity verification for future deepfake detection. The code is available at https://github.com/ewha-mmai/referee.
翻译:随着先进生成模型产生的深度伪造内容迅速构成严重威胁,现有视听深度伪造检测方法难以泛化至未见过的伪造类型。本文提出一种新颖的基于参考感知的视听深度伪造检测方法,命名为Referee。该方法利用仅需单样本的说话人特异性线索,检测超越时空伪影的篡改内容。通过将参考内容与目标内容中身份相关查询匹配并对齐至跨模态特征,Referee能够联合推理视听同步性与身份一致性。在FakeAVCeleb、FaceForensics++和KoDF数据集上的大量实验表明,Referee在跨数据集与跨语言评估协议中实现了最先进的性能。实验结果凸显了跨模态身份验证对未来深度伪造检测的重要性。代码已发布于https://github.com/ewha-mmai/referee。