Speech deepfake detection (DFD) has benefited from diverse acoustic and semantic speech representations, many of which encode valuable speech information and are costly to train. Existing approaches typically enhance DFD by tuning the representations or applying post-hoc classification on frozen features, limiting control over improving discriminative DF cues without distorting original semantics. We find that emotion is encoded across diverse speech features and correlates with DFD. Therefore, we introduce a unified, feature-agnostic, and non-destructive training framework that uses emotion as a bridging constraint to guide speech features toward DFD, treating emotion recognition as a representation alignment objective rather than an auxiliary task, while preserving the original semantic information. Experiments on FakeOrReal and IntheWild show accuracy improvements of up to 6\% and 2\%, respectively, with corresponding reductions in equal error rate. Code is in the supplementary material.
翻译:语音深度伪造检测(DFD)受益于多样化的声学和语义语音表征,其中许多编码了宝贵的语音信息且训练成本高昂。现有方法通常通过调整表征或对冻结特征进行后验分类来增强DFD,这限制了在不扭曲原始语义的情况下改进判别性DF线索的能力。我们发现情感编码于多样化的语音特征中,并与DFD相关。因此,我们引入了一个统一的、特征无关的、非破坏性的训练框架,该框架以情感作为桥梁约束,引导语音特征朝向DFD,将情感识别视为表征对齐目标而非辅助任务,同时保留原始语义信息。在FakeOrReal和IntheWild数据集上的实验显示准确率分别提升了高达6%和2%,并相应降低了等错误率。代码详见补充材料。