Recent progress in audio generation has made it increasingly easy to create highly realistic environmental soundscapes, which can be misused to produce deceptive content, such as fake alarms, gunshots, and crowd sounds, raising concerns for public safety and trust. While deepfake detection for speech and singing voice has been extensively studied, environmental sound deepfake detection (ESDD) remains underexplored. To advance ESDD, the first edition of the ESDD challenge was launched, attracting 97 registered teams and receiving 1,748 valid submissions. This paper presents the task formulation, dataset construction, evaluation protocols, baseline systems, and key insights from the challenge results. Furthermore, we analyze common architectural choices and training strategies among top-performing systems. Finally, we discuss potential future research directions for ESDD, outlining key opportunities and open problems to guide subsequent studies in this field.
翻译:近期音频生成技术的进展使得创建高保真环境音景愈发便捷,此类技术可能被滥用制造欺骗性内容,如虚假警报、枪声和人群噪声,引发公众安全与信任的担忧。尽管针对语音和歌唱声音的深度伪造检测已得到广泛研究,但环境声音深度伪造检测仍处于探索初期。为推进该领域发展,首届环境声音深度伪造检测挑战赛启动,吸引97支注册团队并收到1748份有效提交。本文阐述任务定义、数据集构建、评估方案、基线系统及挑战结果的关键洞见。此外,我们分析了性能最优系统所采用的通用架构选择与训练策略,最后探讨了环境声音深度伪造检测领域潜在的未来研究方向,勾勒关键机遇与未解决问题,以指导该领域的后续研究。