Audio deepfakes generated by modern TTS and voice conversion systems are increasingly difficult to distinguish from real speech, raising serious risks for security and online trust. While state-of-the-art self-supervised models provide rich multi-layer representations, existing detectors treat layers independently and overlook temporal and hierarchical dependencies critical for identifying synthetic artefacts. We propose HierCon, a hierarchical layer attention framework combined with margin-based contrastive learning that models dependencies across temporal frames, neighbouring layers, and layer groups, while encouraging domain-invariant embeddings. Evaluated on ASVspoof 2021 DF and In-the-Wild datasets, our method achieves state-of-the-art performance (1.93% and 6.87% EER), improving over independent layer weighting by 36.6% and 22.5% respectively. The results and attention visualisations confirm that hierarchical modelling enhances generalisation to cross-domain generation techniques and recording conditions.
翻译:现代TTS和语音转换系统生成的音频深度伪造内容与真实语音的区分度日益降低,对安全性和网络信任构成严重威胁。尽管当前最先进的自监督模型能提供丰富的多层表征,但现有检测器通常独立处理各层级,忽略了对于识别合成伪影至关重要的时序与层次依赖关系。本文提出HierCon——一种结合基于间隔的对比学习的层次化层级注意力框架,该框架能够建模跨时间帧、相邻层级及层级组之间的依赖关系,同时促进域不变嵌入的学习。在ASVspoof 2021 DF和In-the-Wild数据集上的评估表明,我们的方法取得了最先进的性能(等错误率分别为1.93%和6.87%),相较于独立层级加权方法分别提升了36.6%和22.5%。实验结果与注意力可视化分析证实,层次化建模能有效增强模型对跨域生成技术和录音条件的泛化能力。