Recent advances in AI-generated voices have intensified the challenge of detecting deepfake audio, posing risks for scams and the spread of disinformation. To tackle this issue, we establish the largest public voice dataset to date, named DeepFakeVox-HQ, comprising 1.3 million samples, including 270,000 high-quality deepfake samples from 14 diverse sources. Despite previously reported high accuracy, existing deepfake voice detectors struggle with our diversely collected dataset, and their detection success rates drop even further under realistic corruptions and adversarial attacks. We conduct a holistic investigation into factors that enhance model robustness and show that incorporating a diversified set of voice augmentations is beneficial. Moreover, we find that the best detection models often rely on high-frequency features, which are imperceptible to humans and can be easily manipulated by an attacker. To address this, we propose the F-SAT: Frequency-Selective Adversarial Training method focusing on high-frequency components. Empirical results demonstrate that using our training dataset boosts baseline model performance (without robust training) by 33%, and our robust training further improves accuracy by 7.7% on clean samples and by 29.3% on corrupted and attacked samples, over the state-of-the-art RawNet3 model.
翻译:人工智能生成语音的最新进展加剧了检测深度伪造音频的挑战,为诈骗和虚假信息传播带来了风险。为解决这一问题,我们建立了迄今为止最大的公开语音数据集DeepFakeVox-HQ,包含130万个样本,其中包含来自14个不同来源的27万个高质量深度伪造样本。尽管先前报道的检测准确率较高,但现有的深度伪造语音检测器在我们多样化收集的数据集上表现不佳,且在现实环境下的数据损坏和对抗攻击下,其检测成功率进一步下降。我们对提升模型鲁棒性的因素进行了系统性研究,结果表明引入多样化的语音数据增强是有益的。此外,我们发现最佳检测模型通常依赖于高频特征,这些特征对人类听觉不可感知且容易被攻击者操纵。为此,我们提出了F-SAT:专注于高频成分的频率选择性对抗训练方法。实验结果表明,使用我们的训练数据集可将基线模型(未进行鲁棒训练)性能提升33%,而我们的鲁棒训练方法在干净样本上比最先进的RawNet3模型进一步提高7.7%的准确率,在受损和受攻击样本上提升29.3%。