In this paper, we propose a deep-learning framework for Environmental Sound Deepfake Detection (ESDD) - the task of identifying whether the sound scene and sound event in an input audio recording is fake or real. To this end, we first conduct extensive experiments to explore how individual spectrograms, a wide range of network architectures, and pre-trained models affect the performance of an ESDD model. The experimental results on the benchmark datasets of EnvSDD indicate that detecting deepfake audio of sound scenes and detecting deepfake audio of sound events should be considered as individual tasks. We also show that fine-tuning a pre-trained model is more effective than training a model from scratch for ESDD. Ultimately, our best model, which fine-tunes the pre-trained BEATs model using the proposed two-phase training strategy, achieves an Accuracy of 0.98, F1 score of 0.95, and AUC score of 0.99 on the Test subset of the EnvSDD dataset. Our best model also achieves an Accuracy of 0.86, F1 score of 0.80, and AUC of 0.93 when evaluated cross-dataset on the ESD-Challenge-TestSet dataset.
翻译:暂无翻译