Audio recorded in real-world environments often contains a mixture of foreground speech and background environmental sounds. With rapid advances in text-to-speech, voice conversion, and other generation models, either component can now be modified independently. Such component-level manipulations are harder to detect, as the remaining unaltered component can mislead the systems designed for whole deepfake audio, and they often sound more natural to human listeners. To address this gap, we have proposed CompSpoofV2 dataset and a separation-enhanced joint learning framework. CompSpoofV2 is a large-scale curated dataset designed for component-level audio anti-spoofing, which contains over 250k audio samples, with a total duration of approximately 283 hours. Based on the CompSpoofV2 and the separation-enhanced joint learning framework, we launch the Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), focusing on component-level spoofing, where both speech and environmental sounds may be manipulated or synthesized, creating a more challenging and realistic detection scenario. The challenge will be held in conjunction with the IEEE International Conference on Multimedia and Expo 2026 (ICME 2026).
翻译:真实环境中录制的音频通常包含前景语音与背景环境声音的混合。随着文本转语音、语音转换及其他生成模型的快速发展,现在可以独立修改任一成分。此类成分级篡改更难被检测,因为剩余未篡改成分可能误导为整体深度伪造音频设计的系统,且它们对人类听者而言往往听起来更自然。为弥补这一空白,我们提出了CompSpoofV2数据集及一种分离增强的联合学习框架。CompSpoofV2是为成分级音频反欺骗设计的大规模精选数据集,包含超过25万个音频样本,总时长约283小时。基于CompSpoofV2与分离增强联合学习框架,我们发起环境感知语音与声音深度伪造检测挑战赛(ESDD2),聚焦于成分级欺骗场景——语音与环境声音均可能被篡改或合成,从而构建更具挑战性与真实性的检测情境。本挑战赛将与2026年IEEE国际多媒体与博览会(ICME 2026)联合举办。