The Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), held in conjunction with ICME 2026, evaluated systems for five component-level audio spoofing detection, where speech and environmental sounds may be manipulated independently or jointly. After the challenge concludes, we analyze the final leaderboard and summarize effective design choices from the top-performing submissions. The challenge attracted 94 registrations from 16 countries; after verification of submission requirements and metadata, 13 teams were retained for the final analysis. On the test set, the best system achieved a Macro-F1 score of 0.8775, substantially outperforming the separation-enhanced joint learning baseline (0.6327). Top systems consistently benefited from modular task decomposition, cross-domain self-supervised encoders, targeted data augmentation, and selective ensembling rather than simple model scaling. At the same time, auxiliary EER analyses reveal persistent difficulty in detecting the spoofed environmental component and in generalizing to unseen generators in the test set. This paper reports challenge results and provides insights for future environment-aware deepfake detection research. The CompSpoofV2 dataset and baseline code remain publicly available for reproducibility.
翻译:环境感知语音与声音深度伪造检测挑战赛(ESDD2)与ICME 2026联合举办,旨在评估面向五类组件级音频欺骗检测的系统性能。在这些场景中,语音和环境声音可能被独立或联合操纵。挑战结束后,我们分析了最终排行榜,并总结了最优参赛系统的有效设计策略。本次挑战吸引了来自16个国家的94支队伍注册;在验证提交要求与元数据后,最终保留13支队伍进入最终分析。在测试集上,最佳系统实现了0.8775的宏F1分数,显著优于基于分离增强的联合学习基线(0.6327)。顶级系统持续受益于模块化任务分解、跨域自监督编码器、定向数据增强以及选择性集成(而非简单模型扩展)。与此同时,辅助等错误率分析揭示:检测被欺骗的环境声音成分以及泛化至测试集中未见生成器仍存在持续困难。本文报告了挑战结果,并为未来环境感知深度伪造检测研究提供了见解。CompSpoofV2数据集与基线代码已公开,可供复现。