With recent advances in speech synthesis including text-to-speech (TTS) and voice conversion (VC) systems enabling the generation of ultra-realistic audio deepfakes, there is growing concern about their potential misuse. However, most deepfake (DF) detection methods rely solely on the fuzzy knowledge learned by a single model, resulting in performance bottlenecks and transparency issues. Inspired by retrieval-augmented generation (RAG), we propose a retrieval-augmented detection (RAD) framework that augments test samples with similar retrieved samples for enhanced detection. We also extend the multi-fusion attentive classifier to integrate it with our proposed RAD framework. Extensive experiments show the superior performance of the proposed RAD framework over baseline methods, achieving state-of-the-art results on the ASVspoof 2021 DF set and competitive results on the 2019 and 2021 LA sets. Further sample analysis indicates that the retriever consistently retrieves samples mostly from the same speaker with acoustic characteristics highly consistent with the query audio, thereby improving detection performance.
翻译:随着语音合成技术的近期进展,包括文本转语音(TTS)和语音转换(VC)系统能够生成超逼真的音频深度伪造,人们对其潜在滥用日益担忧。然而,大多数深度伪造(DF)检测方法仅依赖于单一模型学习的模糊知识,导致性能瓶颈和透明度问题。受检索增强生成(RAG)启发,我们提出了一种检索增强检测(RAD)框架,通过利用检索到的相似样本来增强测试样本,从而实现更强的检测能力。我们还扩展了多融合注意力分类器,使其与提出的RAD框架集成。大量实验表明,所提RAD框架相较于基线方法具有优越性能,在ASVspoof 2021 DF数据集上取得了最先进的结果,并在2019和2021 LA数据集上获得了有竞争力的性能。进一步的样本分析表明,检索器始终能检索到与查询音频声学特征高度一致的同说话人样本,从而提升了检测性能。