Existing deepfake speech detection systems lack generalizability to unseen attacks (i.e., samples generated by generative algorithms not seen during training). Recent studies have explored the use of universal speech representations to tackle this issue and have obtained inspiring results. These works, however, have focused on innovating downstream classifiers while leaving the representation itself untouched. In this study, we argue that characterizing the long-term temporal dynamics of these representations is crucial for generalizability and propose a new method to assess representation dynamics. Indeed, we show that different generative models generate similar representation dynamics patterns with our proposed method. Experiments on the ASVspoof 2019 and 2021 datasets validate the benefits of the proposed method to detect deepfakes from methods unseen during training, significantly improving on several benchmark methods.
翻译:现有深度伪造语音检测系统对未见过的攻击(即训练阶段未出现过的生成算法生成的样本)缺乏泛化能力。近期研究尝试采用通用语音表征解决该问题并取得了令人鼓舞的成果,但这些工作主要聚焦于下游分类器的创新,而表征本身未被涉及。本研究认为,刻画这些表征的长期时间动态特性对泛化能力至关重要,并提出一种评估表征动态特性的新方法。事实上,我们通过该方法证明,不同生成模型会产生相似的表示动态模式。在ASVspoof 2019和2021数据集上的实验验证了所提方法对检测训练中未见过的深度伪造样本的有效性,显著提升了多项基准方法的性能。