Audio deepfakes pose a growing threat, already exploited in fraud and misinformation. A key challenge is ensuring detectors remain robust to unseen synthesis methods and diverse speakers, since generation techniques evolve quickly. Despite strong benchmark results, current systems struggle to generalize to new conditions limiting real-world reliability. To address this, we introduce TWINSHIFT, a benchmark explicitly designed to evaluate detection robustness under strictly unseen conditions. Our benchmark is constructed from six different synthesis systems, each paired with disjoint sets of speakers, allowing for a rigorous assessment of how well detectors generalize when both the generative model and the speaker identity change. Through extensive experiments, we show that TWINSHIFT reveals important robustness gaps, uncover overlooked limitations, and provide principled guidance for developing ADD systems. The TWINSHIFT benchmark can be accessed at https://github.com/intheMeantime/TWINSHIFT.
翻译:音频深度伪造技术构成日益严重的威胁,已广泛应用于欺诈与虚假信息传播领域。一个关键挑战在于确保检测器对未见过的合成方法及多样化说话人保持鲁棒性,因为生成技术正快速演进。尽管现有系统在基准测试中表现优异,但其泛化至新场景的能力不足,限制了实际应用的可靠性。为此,我们提出TWINSHIFT基准,该基准专门设计用于在严格未见条件下评估检测鲁棒性。我们基于六种不同的合成系统构建该基准,每个系统均与互斥的说话人集合配对,从而能够严格评估当生成模型与说话人身份同时变化时检测器的泛化能力。通过大量实验,我们证明TWINSHIFT能够揭示重要的鲁棒性缺陷,发现被忽视的局限性,并为音频深度伪造检测系统的开发提供理论指导。TWINSHIFT基准可通过https://github.com/intheMeantime/TWINSHIFT访问。