Recently, a novel form of audio partial forgery has posed challenges to its forensics, requiring advanced countermeasures to detect subtle forgery manipulations within long-duration audio. However, existing countermeasures still serve a classification purpose and fail to perform meaningful analysis of the start and end timestamps of partial forgery segments. To address this challenge, we introduce a novel coarse-to-fine proposal refinement framework (CFPRF) that incorporates a frame-level detection network (FDN) and a proposal refinement network (PRN) for audio temporal forgery detection and localization. Specifically, the FDN aims to mine informative inconsistency cues between real and fake frames to obtain discriminative features that are beneficial for roughly indicating forgery regions. The PRN is responsible for predicting confidence scores and regression offsets to refine the coarse-grained proposals derived from the FDN. To learn robust discriminative features, we devise a difference-aware feature learning (DAFL) module guided by contrastive representation learning to enlarge the sensitive differences between different frames induced by minor manipulations. We further design a boundary-aware feature enhancement (BAFE) module to capture the contextual information of multiple transition boundaries and guide the interaction between boundary information and temporal features via a cross-attention mechanism. Extensive experiments show that our CFPRF achieves state-of-the-art performance on various datasets, including LAV-DF, ASVS2019PS, and HAD.
翻译:近年来,一种新型的音频局部伪造形式对其取证提出了挑战,需要先进的对抗措施来检测长时音频中细微的伪造操作。然而,现有的对抗措施仍仅服务于分类目的,未能对局部伪造片段的起止时间戳进行有意义的分析。为应对这一挑战,我们提出了一种新颖的粗到细提案精化框架,该框架集成了帧级检测网络与提案精化网络,用于音频时序伪造的检测与定位。具体而言,帧级检测网络旨在挖掘真实帧与伪造帧之间信息丰富的不一致线索,以获得有助于粗略指示伪造区域的判别性特征。提案精化网络则负责预测置信度分数和回归偏移量,以精化源自帧级检测网络的粗粒度提案。为学习鲁棒的判别性特征,我们设计了一个由对比表示学习引导的差异感知特征学习模块,以扩大由微小操作引起的不同帧之间的敏感差异。我们进一步设计了一个边界感知特征增强模块,以捕获多个过渡边界的上下文信息,并通过交叉注意力机制引导边界信息与时序特征之间的交互。大量实验表明,我们的粗到细提案精化框架在多个数据集上实现了最先进的性能,包括LAV-DF、ASVS2019PS和HAD。