With the advancement of audio generation, generative models can produce highly realistic audios. However, the proliferation of deepfake general audio can pose negative consequences. Therefore, we propose a new task, deepfake general audio detection, which aims to identify whether audio content is manipulated and to locate deepfake regions. Leveraging an automated manipulation pipeline, a dataset named FakeSound for deepfake general audio detection is proposed, and samples can be viewed on website https://FakeSoundData.github.io. The average binary accuracy of humans on all test sets is consistently below 0.6, which indicates the difficulty humans face in discerning deepfake audio and affirms the efficacy of the FakeSound dataset. A deepfake detection model utilizing a general audio pre-trained model is proposed as a benchmark system. Experimental results demonstrate that the performance of the proposed model surpasses the state-of-the-art in deepfake speech detection and human testers.
翻译:随着音频生成技术的进步,生成模型能够产生高度逼真的音频。然而,深度伪造通用音频的泛滥可能带来负面影响。为此,我们提出了一项新任务——深度伪造通用音频检测,旨在识别音频内容是否被篡改并定位伪造区域。利用自动化篡改流程,我们构建了一个名为FakeSound的深度伪造通用音频检测数据集,样本可在网站https://FakeSoundData.github.io上查看。人类在所有测试集上的平均二元准确率始终低于0.6,这表明人类在辨别深度伪造音频时面临困难,同时也验证了FakeSound数据集的有效性。我们提出了一种基于通用音频预训练模型的深度伪造检测模型作为基准系统。实验结果表明,所提模型的性能超越了深度伪造语音检测领域的最先进方法以及人类测试者。