Audio deepfake detection is increasingly important as synthetic speech becomes more realistic and accessible. Recent methods, including those using graph neural networks (GNNs) to model frequency and temporal dependencies, show strong potential but need large amounts of labeled data, which limits their practical use. Label-efficient alternatives like graph-based non-contrastive learning offer a potential solution, as they can learn useful representations from unlabeled data without using negative samples. However, current graph non-contrastive approaches are built for single-view graph representations and cannot be directly used for audio, which has unique spectral and temporal structures. Bridging this gap requires dual-view graph modeling suited to audio signals. In this work, we introduce SIGNL (Spectral-temporal vIsion Graph Non-contrastive Learning), a label-efficient expert system for detecting audio deepfakes. SIGNL operates on the visual representation of audio, such as spectrograms or other time-frequency encodings, transforming them into spectral and temporal graphs for structured feature extraction. It then employs graph convolutional encoders to learn complementary frequency-time features, effectively capturing the unique characteristics of audio. These encoders are pre-trained using a non-contrastive self-supervised learning strategy on augmented graph pairs, enabling effective representation learning without labeled data. The resulting encoders are then fine-tuned on minimal labelled data for downstream deepfake detection. SIGNL achieves strong performance on multiple audio deepfake detection benchmarks, including 7.88% EER on ASVspoof 2021 DF and 3.95% EER on ASVspoof 5 using only 5% labeled data. It also generalizes well to unseen conditions, reaching 10.16% EER on the In-The-Wild dataset when trained on CFAD.
翻译:随着合成语音变得更加逼真和易于获取,音频深度伪造检测的重要性日益凸显。近期方法,包括使用图神经网络(GNN)建模频率与时间依赖关系的方法,显示出强大潜力,但需要大量标注数据,这限制了其实际应用。基于图的非对比学习等标签高效替代方案提供了一种潜在的解决方案,因为它们能够从未标注数据中学习有用的表征,且无需使用负样本。然而,当前的图非对比学习方法是为单视图图表征设计的,无法直接适用于具有独特频谱与时间结构的音频数据。弥合这一差距需要适用于音频信号的双视图图建模。在本工作中,我们提出了SIGNL(谱-时视觉图非对比学习),一种用于检测音频深度伪造的标签高效专家系统。SIGNL在音频的视觉表示(如频谱图或其他时频编码)上操作,将其转换为谱图和时图以进行结构化特征提取。随后,它采用图卷积编码器来学习互补的频率-时间特征,有效捕捉音频的独特特性。这些编码器使用非对比自监督学习策略在增强的图对上进行了预训练,从而能够在无标注数据的情况下实现有效的表征学习。最终得到的编码器随后在少量标注数据上进行微调,用于下游的深度伪造检测任务。SIGNL在多个音频深度伪造检测基准测试中取得了强劲性能,包括在仅使用5%标注数据的情况下,在ASVspoof 2021 DF上达到7.88%的等错误率(EER),在ASVspoof 5上达到3.95%的EER。该系统在未见过的条件下也表现出良好的泛化能力,在CFAD上训练后,在In-The-Wild数据集上达到了10.16%的EER。