Audio recorded in real-world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation-enhanced joint learning framework. CompSpoofV2 is a large-scale curated dataset designed for component-level audio anti-spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation-enhanced joint learning framework, we launch the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component-level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).
翻译:现实环境中录制的音频通常包含前景语音与背景环境声音的混合。随着文本到语音、语音转换及其他生成模型的快速发展,任一成分现均可被独立修改。此类成分级篡改更难被检测,因为未受篡改的剩余成分可能误导为整体深度伪造音频设计的检测系统,且其对人耳听感往往更为自然。为填补这一空白,我们提出了CompSpoofV2数据集及一种分离增强的联合学习框架。CompSpoofV2是一个为成分级音频反欺骗设计的大规模精选数据集,包含超过25万个音频样本,总时长约283小时。基于CompSpoofV2数据集与分离增强联合学习框架,我们发起环境感知语音与声音深度伪造检测挑战赛(ESDD2),聚焦于成分级伪造场景——语音与环境声音均可能被篡改或合成,从而构建更具挑战性与现实性的检测情境。本挑战赛将与2026年IEEE国际多媒体与博览会(ICME 2026)联合举办。