Partial audio deepfakes, where synthesized segments are spliced into genuine recordings, are particularly deceptive because most of the audio remains authentic. Existing detectors are supervised: they require frame-level annotations, overfit to specific synthesis pipelines, and must be retrained as new generative models emerge. We argue that this supervision is unnecessary. We hypothesize that speech foundation models implicitly encode a forensic signal: genuine speech forms smooth, slowly varying embedding trajectories, while splice boundaries introduce abrupt disruptions in frame-level transitions. Building on this, we propose TRACE (Training-free Representation-based Audio Countermeasure via Embedding dynamics), a training-free framework that detects partial audio deepfakes by analyzing the first-order dynamics of frozen speech foundation model representations without any training, labeled data, or architectural modification. We evaluate TRACE on four benchmarks that span two languages using six speech foundation models. In PartialSpoof, TRACE achieves 8.08% EER, competitive with fine-tuned supervised baselines. In LlamaPartialSpoof, the most challenging benchmark featuring LLM-driven commercial synthesis, TRACE surpasses a supervised baseline outright (24.12% vs. 24.49% EER) without any target-domain data. These results show that temporal dynamics in speech foundation models provide an effective, generalize signal for training-free audio forensics.
翻译:局部音频深度伪造(部分合成片段被拼接至真实录音中)因大部分音频内容保持真实而极具欺骗性。现有检测器采用监督学习范式,需要帧级标注、过度适配特定合成流程,且当新生成模型出现时必须重新训练。我们认为这种监督方式并非必要。我们假设语音基础模型隐式编码了取证信号:真实语音形成平滑且缓慢变化的嵌入轨迹,而拼接边界会在帧级转换中引发突变。基于此,我们提出TRACE(基于嵌入动态的无训练表征音频对抗方法),这是一种无需训练的检测框架,通过分析冻结语音基础模型表征的一阶动态来检测局部音频深度伪造,无需任何训练、标注数据或架构修改。我们在覆盖两种语言的四个基准数据集上使用六种语音基础模型对TRACE进行评估。在PartialSpoof测试中,TRACE取得8.08%等错误率,与经微调的监督基线相当。在最具挑战性的LlamaPartialSpoof基准(采用大模型驱动的商业合成)中,TRACE在无需目标域数据情况下直接超越监督基线(EER 24.12% vs 24.49%)。这些结果表明,语音基础模型的时间动态特性为无训练音频取证提供了有效且泛化的信号。