AI-assisted Tagging of Deepfake Audio Calls using Challenge-Response

Scammers are aggressively leveraging AI voice-cloning technology for social engineering attacks, a situation significantly worsened by the advent of audio Real-time Deepfakes (RTDFs). RTDFs can clone a target's voice in real-time over phone calls, making these interactions highly interactive and thus far more convincing. Our research confidently addresses the gap in the existing literature on deepfake detection, which has largely been ineffective against RTDF threats. We introduce a robust challenge-response-based method to detect deepfake audio calls, pioneering a comprehensive taxonomy of audio challenges. Our evaluation pitches 20 prospective challenges against a leading voice-cloning system. We have compiled a novel open-source challenge dataset with contributions from 100 smartphone and desktop users, yielding 18,600 original and 1.6 million deepfake samples. Through rigorous machine and human evaluations of this dataset, we achieved a deepfake detection rate of 86% and an 80% AUC score, respectively. Notably, utilizing a set of 11 challenges significantly enhances detection capabilities. Our findings reveal that combining human intuition with machine precision offers complementary advantages. Consequently, we have developed an innovative human-AI collaborative system that melds human discernment with algorithmic accuracy, boosting final joint accuracy to 82.9%. This system highlights the significant advantage of AI-assisted pre-screening in call verification processes. Samples can be heard at https://mittalgovind.github.io/autch-samples/

翻译：诈骗者正积极利用AI语音克隆技术进行社会工程攻击，而音频实时深度伪造（RTDF）的出现使这一问题显著恶化。RTDF可在通话中实时克隆目标人物的声音，使这些交互具有高度互动性，从而更具说服力。我们的研究有效填补了现有深度伪造检测文献中的空白——这些文献在面对RTDF威胁时基本无效。我们提出了一种基于挑战-响应的稳健方法用于检测深度伪造音频通话，并开创性地建立了音频挑战的综合分类体系。我们的评估将20种前瞻性挑战与领先的语音克隆系统进行了对比。我们编制了一个包含100名智能手机和桌面用户贡献的新型开源挑战数据集，生成了18,600个原始样本和160万个深度伪造样本。通过对该数据集的严格机器评估和人工评估，我们分别实现了86%的深度伪造检测率和80%的AUC分数。值得注意的是，使用一组11个挑战可显著提升检测能力。我们的研究结果表明，将人类直觉与机器精度相结合具有互补优势。因此，我们开发了一种创新的人机协作系统，该系统将人类辨别力与算法准确性相融合，将最终联合精度提升至82.9%。该系统凸显了AI辅助预筛选在通话验证过程中的显著优势。示例音频可访问 https://mittalgovind.github.io/autch-samples/ 聆听。