Audio Deepfake Detection (ADD) aims to detect spoof speech from bonafide speech. Most prior studies assume that stronger correlations within or across acoustic and emotional features imply authenticity, and thus focus on enhancing or measuring such correlations. However, existing methods often treat acoustic and emotional features in isolation or rely on correlation metrics, which overlook subtle desynchronization between them and smooth out abrupt discontinuities. To address these issues, we propose EAI-ADD, which treats cross level emotion acoustic inconsistency as the primary detection signal. We first project emotional and acoustic representations into a comparable space. Then we progressively integrate frame level and utterance level emotion features with acoustic features to capture cross level emotion acoustic inconsistencies across different temporal granularities. Experimental results on the ASVspoof 2019LA and 2021LA datasets demonstrate that the proposed EAI-ADD outperforms baselines, providing a more effective solution for audio anti spoofing detection.
翻译:音频深度伪造检测旨在区分伪造语音与真实语音。现有研究大多假设声学特征与情感特征内部或跨特征间更强的相关性意味着真实性,因而侧重于增强或度量此类相关性。然而,现有方法往往孤立处理声学与情感特征,或依赖相关性度量指标,忽视了二者间细微的失同步现象,并平滑了突发的间断特征。为解决这些问题,我们提出EAI-ADD方法,将跨层级情感-声学不一致性作为核心检测信号。我们首先将情感表征与声学表征映射至可比较的空间,随后逐步融合帧层级与语句层级的情感特征和声学特征,以捕捉不同时间粒度下的跨层级情感-声学不一致性。在ASVspoof 2019LA与2021LA数据集上的实验结果表明,所提出的EAI-ADD方法优于基线模型,为音频反伪造检测提供了更有效的解决方案。