BreathNet: Generalizable Audio Deepfake Detection via Breath-Cue-Guided Feature Refinement

As deepfake audio becomes more realistic and diverse, developing generalizable countermeasure systems has become crucial. Existing detection methods primarily depend on XLS-R front-end features to improve generalization. Nonetheless, their performance remains limited, partly due to insufficient attention to fine-grained information, such as physiological cues or frequency-domain features. In this paper, we propose BreathNet, a novel audio deepfake detection framework that integrates fine-grained breath information to improve generalization. Specifically, we design BreathFiLM, a feature-wise linear modulation mechanism that selectively amplifies temporal representations based on the presence of breathing sounds. BreathFiLM is trained jointly with the XLS-R extractor, in turn encouraging the extractor to learn and encode breath-related cues into the temporal features. Then, we use the frequency front-end to extract spectral features, which are then fused with temporal features to provide complementary information introduced by vocoders or compression artifacts. Additionally, we propose a group of feature losses comprising Positive-only Supervised Contrastive Loss (PSCL), center loss, and contrast loss. These losses jointly enhance the discriminative ability, encouraging the model to separate bona fide and deepfake samples more effectively in the feature space. Extensive experiments on five benchmark datasets demonstrate state-of-the-art (SOTA) performance. Using the ASVspoof 2019 LA training set, our method attains 1.99% average EER across four related eval benchmarks, with particularly strong performance on the In-the-Wild dataset, where it achieves 4.70% EER. Moreover, under the ASVspoof5 evaluation protocol, our method achieves an EER of 4.94% on this latest benchmark.

翻译：随着深度伪造音频变得更加逼真和多样化，开发可泛化的对抗系统变得至关重要。现有的检测方法主要依赖XLS-R前端特征来提升泛化能力。然而，其性能仍然有限，部分原因在于对细粒度信息（如生理线索或频域特征）的关注不足。本文提出BreathNet，一种新颖的音频深度伪造检测框架，通过整合细粒度呼吸信息来提升泛化能力。具体而言，我们设计了BreathFiLM，一种特征级线性调制机制，能够根据呼吸声的存在选择性增强时序表征。BreathFiLM与XLS-R特征提取器联合训练，从而促使提取器学习并将呼吸相关线索编码到时序特征中。随后，我们使用频率前端提取频谱特征，并将其与时序特征融合，以提供由声码器或压缩伪影引入的互补信息。此外，我们提出了一组特征损失函数，包括仅正样本监督对比损失（PSCL）、中心损失和对比损失。这些损失函数共同增强了模型的判别能力，促使模型在特征空间中更有效地分离真实样本和深度伪造样本。在五个基准数据集上的大量实验展示了最先进的性能。使用ASVspoof 2019 LA训练集，我们的方法在四个相关评估基准上实现了1.99%的平均等错误率，尤其在In-the-Wild数据集上表现突出，达到了4.70%的等错误率。此外，在ASVspoof5评估协议下，我们的方法在这一最新基准上实现了4.94%的等错误率。