Partially spoofed audio detection is a challenging task, lying in the need to accurately locate the authenticity of audio at the frame level. To address this issue, we propose a fine-grained partially spoofed audio detection method, namely Temporal Deepfake Location (TDL), which can effectively capture information of both features and locations. Specifically, our approach involves two novel parts: embedding similarity module and temporal convolution operation. To enhance the identification between the real and fake features, the embedding similarity module is designed to generate an embedding space that can separate the real frames from fake frames. To effectively concentrate on the position information, temporal convolution operation is proposed to calculate the frame-specific similarities among neighboring frames, and dynamically select informative neighbors to convolution. Extensive experiments show that our method outperform baseline models in ASVspoof2019 Partial Spoof dataset and demonstrate superior performance even in the crossdataset scenario.
翻译:局部伪造音频检测是一项具有挑战性的任务,其难点在于需要在帧级别精确定位音频的真实性。为解决这一问题,我们提出了一种细粒度的局部伪造音频检测方法,即时间深度伪造定位(TDL),该方法能够有效捕捉特征信息和位置信息。具体而言,我们的方法包含两个新颖部分:嵌入相似度模块和时间卷积操作。为增强真实特征与伪造特征之间的区分度,嵌入相似度模块被设计用于生成一个能够将真实帧与伪造帧分离的嵌入空间。为有效聚焦位置信息,时间卷积操作被提出用于计算相邻帧之间的帧级相似度,并动态选择信息丰富的邻居进行卷积。大量实验表明,我们的方法在ASVspoof2019局部伪造数据集上优于基线模型,并且在跨数据集场景下也展现出卓越性能。