Current anti-spoofing and audio deepfake detection systems use either magnitude spectrogram-based features (such as CQT or Melspectrograms) or raw audio processed through convolution or sinc-layers. Both methods have drawbacks: magnitude spectrograms discard phase information, which affects audio naturalness, and raw-feature-based models cannot use traditional explainable AI methods. This paper proposes a new approach that combines the benefits of both methods by using complex-valued neural networks to process the complex-valued, CQT frequency-domain representation of the input audio. This method retains phase information and allows for explainable AI methods. Results show that this approach outperforms previous methods on the "In-the-Wild" anti-spoofing dataset and enables interpretation of the results through explainable AI. Ablation studies confirm that the model has learned to use phase information to detect voice spoofing.
翻译:当前反欺骗和音频深度伪造检测系统通常使用基于幅度谱图的特征(如CQT或梅尔谱图),或通过卷积层或sinc层处理的原始音频。两种方法均存在缺陷:幅度谱图丢弃了影响音频自然度的相位信息,而基于原始特征的模型无法应用传统的可解释人工智能方法。本文提出了一种新方法,通过使用复数神经网络处理输入音频的复数CQT频域表示,融合了两种方法的优势。该方法保留了相位信息,并支持可解释人工智能方法。实验结果表明,该方法在"In-the-Wild"反欺骗数据集上优于先前方法,并通过可解释人工智能实现了对结果的分析。消融研究证实,模型已学会利用相位信息检测语音欺骗。