In this paper, we introduce the concept of forensic similarity in the speech deepfake detection domain, which aims to determine whether two audio segments share the same underlying forensic traces. Our approach is inspired by prior work in the image domain. To transfer this idea to the audio domain, we propose a two-stage deep learning framework consisting of a Siamese-based feature extractor and a core decision module, referred to as the similarity network. The system goal to assess whether two speech samples originate from the same source by comparing their forensic characteristics. In practice, the model maps pairs of audio segments to a similarity score indicating whether they contain identical or different forensic traces. We evaluate the proposed method on the emerging task of source verification, demonstrating its ability to determine whether two speech samples were generated by the same model. In addition, we explore its applicability to audio splicing detection as a complementary use case. Experimental results show that the proposed approach generalizes well to previously unseen forensic traces, highlighting its robustness, flexibility, and practical relevance for digital audio forensics.
翻译:本文在语音深度伪造检测领域提出了法医学相似性的概念,旨在判断两段音频片段是否共享相同的底层法医学痕迹。我们的方法受图像领域先前工作的启发。为将该思想迁移至音频领域,我们提出了一种两阶段深度学习框架,该框架由基于孪生网络的特征提取器与核心决策模块(即相似性网络)组成。系统目标是通过比较两段语音样本的法医学特征,判断其是否源自同一来源。实际应用中,该模型将音频片段对映射为相似性得分,以指示它们是否包含相同或不同的法医学痕迹。我们在新兴的源验证任务上评估了所提方法,证明其能够判定两段语音样本是否由同一模型生成。此外,我们探索了该方法在音频拼接检测中的补充应用场景。实验结果表明,所提方法对未见的法医学痕迹具有良好的泛化能力,凸显了其在数字音频取证领域的鲁棒性、灵活性及实际应用价值。