ADD in real-world scenarios has evolved from speech-only spoofing to more challenging component-level settings, where speech and environmental sounds may be independently manipulated. To tackle this, we propose EnvTriCascade, an Environment-Aware Tri-Stage Cascaded framework for the ESDD2 Challenge. First, a mix-consistency detector provides a binary prior to distinguish original recordings from manipulated mixtures, which calibrates the final decisions. Next, two complementary five-class detectors, leveraging SSLAM+XLS-R and EAT-large+XLS-R representations, extract robust multi-branch features integrated via a cross-branch attention-gated classifier. To enhance robustness against diverse mixing conditions, we incorporate RawBoost augmentation. Trained exclusively on the official CompSpoofV2 dataset, our system achieves a Macro-F1 score of 0.8266 on the test set, significantly outperforming the official baseline and ranking second in the challenge.
翻译:现实场景中的音频深度伪造已从仅包含语音的欺骗发展为更具挑战性的组件级设定,在此设定中语音与环境声音可能被独立操控。为应对这一问题,我们提出EnvTriCascade——一种面向ESDD2挑战的环境感知三阶段级联框架。首先,混合一致性检测器提供二元先验以区分原始录音与经过操控的混合音频,从而校准最终决策。其次,两个互补的五类检测器分别利用SSLAM+XLS-R与EAT-large+XLS-R表征提取鲁棒的多分支特征,并通过跨分支注意力门控分类器进行集成。为增强对多样化混合条件的鲁棒性,我们引入了RawBoost数据增强。该模型仅在官方CompSpoofV2数据集上训练,在测试集上达到0.8266的宏F1分数,显著优于官方基线,并在挑战赛中位列第二。