Deepfakes - manipulated or forged audio and video media - pose significant security risks to individuals, organizations, and society at large. To address these challenges, machine learning-based classifiers are commonly employed to detect deepfake content. In this paper, we assess the robustness of such classifiers through a systematic penetration testing methodology, which we introduce as DeePen. Our approach operates without prior knowledge of or access to the target deepfake detection models. Instead, it leverages a set of carefully selected signal processing modifications - referred to as attacks - to evaluate model vulnerabilities. Using DeePen, we analyze both real-world production systems and publicly available academic model checkpoints, demonstrating that all tested systems exhibit weaknesses and can be reliably deceived by simple manipulations such as time-stretching or echo addition. Furthermore, our findings reveal that while some attacks can be mitigated by retraining detection systems with knowledge of the specific attack, others remain persistently effective.
翻译:深度伪造——经过篡改或伪造的音频与视频媒体——对个人、组织乃至整个社会构成重大安全风险。为应对这些挑战,基于机器学习的分类器常被用于检测深度伪造内容。本文中,我们通过一种系统化的渗透测试方法(我们称之为DeePen)来评估此类分类器的鲁棒性。我们的方法无需事先了解或访问目标深度伪造检测模型,而是利用一组精心挑选的信号处理修改(称为攻击)来评估模型脆弱性。借助DeePen,我们分析了现实生产系统与公开的学术模型检查点,证明所有被测试系统均存在弱点,且可被时间拉伸或回声添加等简单操作可靠地欺骗。此外,我们的研究发现表明,尽管部分攻击可通过用特定攻击知识重新训练检测系统来缓解,但其他攻击仍持续有效。