With the proliferation of Large Language Model (LLM) based deepfake audio, there is an urgent need for effective detection methods. Previous deepfake audio generation methods typically involve a multi-step generation process, with the final step using a vocoder to predict the waveform from handcrafted features. However, LLM-based audio is directly generated from discrete neural codecs in an end-to-end generation process, skipping the final step of vocoder processing. This poses a significant challenge for current audio deepfake detection (ADD) models based on vocoder artifacts. To effectively detect LLM-based deepfake audio, we focus on the core of the generation process, the conversion from neural codec to waveform. We propose Codecfake dataset, which is generated by seven representative neural codec methods. Experiment results show that codec-trained ADD models exhibit a 41.406% reduction in average equal error rate compared to vocoder-trained ADD models on the Codecfake test set.
翻译:随着基于大语言模型(LLM)的深度伪造音频迅速扩散,亟需有效的检测方法。以往的深度伪造音频生成方法通常包含多步生成流程,最后一步通过声码器从手工特征预测波形。然而,基于LLM的音频直接通过离散神经编解码器以端到端方式生成,跳过了声码器处理的最终环节。这对当前依赖声码器伪影的音频深度伪造检测(ADD)模型构成了重大挑战。为有效检测基于LLM的深度伪造音频,我们聚焦于生成过程的核心环节——从神经编解码器到波形的转换。我们提出Codecfake数据集,该数据集由七种代表性神经编解码方法生成。实验结果表明,在Codecfake测试集上,基于编解码器训练的ADD模型相较于基于声码器训练的ADD模型,平均等错误率降低了41.406%。