The growing prevalence of speech deepfakes has raised serious concerns, particularly in real-world scenarios such as telephone fraud and identity theft. While many anti-spoofing systems have demonstrated promising performance on lab-generated synthetic speech, they often fail when confronted with physical replay attacks-a common and low-cost form of attack used in practical settings. Our experiments show that models trained on existing datasets exhibit severe performance degradation, with average accuracy dropping to 59.6% when evaluated on replayed audio. To bridge this gap, we present EchoFake, a comprehensive dataset comprising more than 120 hours of audio from over 13,000 speakers, featuring both cutting-edge zero-shot text-to-speech (TTS) speech and physical replay recordings collected under varied devices and real-world environmental settings. Additionally, we evaluate three baseline detection models and show that models trained on EchoFake achieve lower average EERs across datasets, indicating better generalization. By introducing more practical challenges relevant to real-world deployment, EchoFake offers a more realistic foundation for advancing spoofing detection methods.
翻译:随着语音深度伪造的日益普及,特别是在电话诈骗和身份盗窃等现实场景中,引发了严重关切。尽管许多反欺骗系统在处理实验室合成的语音时表现出令人鼓舞的性能,但在面对实际环境中常见且低成本的物理重放攻击时,它们往往失效。我们的实验表明,在现有数据集上训练的模型在评估重放音频时性能严重下降,平均准确率降至59.6%。为弥合这一差距,我们提出了EchoFake,一个包含来自超过13000个说话人的120多小时音频的综合数据集,该数据集既包含先进的零样本文本转语音(TTS)语音,也包含在不同设备和现实环境设置下收集的物理重放录音。此外,我们评估了三种基线检测模型,结果表明,在EchoFake上训练的模型在不同数据集上的平均等错误率(EER)更低,显示出更强的泛化能力。通过引入更多与现实部署相关的实际挑战,EchoFake为推进欺骗检测方法提供了更真实的基础。