Achieving robust generalization against unseen attacks remains a challenge in Audio Deepfake Detection (ADD), driven by the rapid evolution of generative models. To address this, we propose a framework centered on hard sample classification. The core idea is that a model capable of distinguishing challenging hard samples is inherently equipped to handle simpler cases effectively. We investigate multiple reconstruction paradigms, identifying the diffusion-based method as optimal for generating hard samples. Furthermore, we leverage multi-layer feature aggregation and introduce a Regularization-Assisted Contrastive Learning (RACL) objective to enhance generalizability. Experiments demonstrate the superior generalization of our approach, with our best model achieving a significant reduction in the average Equal Error Rate (EER) compared to the baseline.
翻译:音频深度伪造检测(Audio Deepfake Detection, ADD)中,生成模型的快速演进使得面对未知攻击时实现稳健泛化仍是一大挑战。为此,我们提出以难样本分类为核心的框架,其核心理念在于:能够区分高难度样本的模型自然具备处理简单案例的能力。我们研究了多种重建范式,发现基于扩散的样本生成方法最优。进一步地,我们利用多层特征聚合,并引入正则化辅助对比学习(Regularization-Assisted Contrastive Learning, RACL)目标以增强泛化能力。实验表明,本方法具有优越的泛化性能,相较于基准方法,最优模型在平均等错误率(Equal Error Rate, EER)上取得了显著降低。