The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.
翻译:语音深度伪造的日益普遍引发了严重关切,尤其在电话诈骗和身份盗窃等现实场景中。尽管许多反欺骗系统在实验室生成的合成语音上表现出良好性能,但在面对物理回放攻击——一种实际场景中常见且低成本的攻击形式时,它们往往失效。我们的实验表明,在现有数据集上训练的模型性能严重下降,在回放音频评估中平均准确率降至59.6%。为弥补这一差距,我们提出了EchoFake,这是一个包含超过13,000名说话者、总计120多小时音频的综合性数据集,其特点是既包含前沿的零样本文本到语音(TTS)语音,也包含在不同设备和真实环境设置下采集的物理回放录音。此外,我们评估了三种基线检测模型,结果表明在EchoFake上训练的模型在跨数据集评估中实现了更低的平均等错误率(EER),显示出更好的泛化能力。通过引入更多与实际部署相关的现实挑战,EchoFake为推进欺骗检测方法提供了更真实的基础。