Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates "in-the-wild" variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.
翻译:近期基于神经音频编解码器的语音生成技术(CodecFake)可产生高度逼真的音频,对现有深度伪造对抗模型构成挑战。虽然使用编解码器重合成语音(CoRS)作为代理数据可提升性能,但常面临泛化能力受限的问题。本文提出域偏移特征增强(DSFA)方法,通过在微调过程中将确定性特征统计量转化为随机分布,模拟"真实世界"变化。为评估泛化性,我们进一步引入基于编解码器的语音生成扩展评估(CoSG ExtEval)数据集,这是CoSG Eval(来自CodecFake+)数据集的更具挑战性的扩展版本,包含40个未见过的生成模型和长时音频。实验结果表明,将后训练SSL骨干网络与DSFA相结合,可有效缩小代理域到真实域的差异。该方法在CoSG Eval和CoSG ExtEval中针对多种CodecFake攻击均实现了最先进性能。