Partially spoofed audio detection is a challenging task, lying in the need to accurately locate the authenticity of audio at the frame level. To address this issue, we propose a fine-grained partially spoofed audio detection method, namely Temporal Deepfake Location (TDL), which can effectively capture information of both features and locations. Specifically, our approach involves two novel parts: embedding similarity module and temporal convolution operation. To enhance the identification between the real and fake features, the embedding similarity module is designed to generate an embedding space that can separate the real frames from fake frames. To effectively concentrate on the position information, temporal convolution operation is proposed to calculate the frame-specific similarities among neighboring frames, and dynamically select informative neighbors to convolution. Extensive experiments show that our method outperform baseline models in ASVspoof2019 Partial Spoof dataset and demonstrate superior performance even in the crossdataset scenario. The code is released online.
翻译:部分虚假音频检测是一项具有挑战性的任务,其难点在于需要在帧级别准确判断音频的真实性。为解决该问题,我们提出了一种细粒度的部分虚假音频检测方法——时间深度伪造定位(TDL),该方法能够有效捕获特征与位置信息。具体而言,我们的方法包含两个创新模块:嵌入相似度模块和时间卷积操作。为增强真实特征与虚假特征的区分能力,嵌入相似度模块被设计用于生成一个能够分离真实帧与虚假帧的嵌入空间。为有效聚焦于位置信息,时间卷积操作通过计算相邻帧间的帧级相似度,并动态选择信息量大的邻居进行卷积。大量实验表明,我们的方法在ASVspoof2019部分虚假数据集上优于基线模型,即使在跨数据集场景下也展现出卓越性能。相关代码已开源发布。