Audio deepfake detectors often fail to generalize across speakers, as they learn speaker-identity features rather than synthesis artifacts, known as implicit identity leakage. Existing methods address this but incur architectural complexity or training instability. This paper proposes a dual-granularity orthogonal disentanglement framework enforcing feature independence at two levels: sample-level cosine orthogonality captures directional decorrelation, while batch-level cross-covariance regularization eliminates linear correlations across embedding dimensions. A curriculum disentanglement schedule progressively strengthens the orthogonality constraint without auxiliary networks or adversarial dynamics. Experiments on ASVspoof 2019 LA, ASVspoof 2021 DF, and In-the-Wild datasets demonstrate that the proposed method achieves 1.35%, 7.88%, and 21.58% equal error rates (EER), respectively, surpassing gradient reversal disentanglement by 2.60% absolute on cross-dataset transfer.
翻译:音频深度伪造检测器常因学习说话人身份特征而非合成伪影(即隐式身份泄露)而难以跨说话人泛化。现有方法虽能缓解此问题,但引入架构复杂性或训练不稳定性。本文提出一种双粒度正交解耦框架,在两种层级上强制特征独立性:样本级余弦正交性捕捉方向性去相关,而批级交叉协方差正则化消除嵌入维度间的线性相关性。课程式解耦调度机制逐步增强正交约束,无需辅助网络或对抗动力学。在ASVspoof 2019 LA、ASVspoof 2021 DF及In-the-Wild数据集上的实验表明,所提方法分别取得1.35%、7.88%和21.58%的等错误率(EER),在跨数据集迁移中相比梯度反转解耦方法绝对值提升2.60%。