Localizing partial deepfake audio, where only segments of speech are manipulated, remains challenging due to the subtle and scattered nature of these modifications. Existing approaches typically rely on frame-level predictions to identify spoofed segments, and some recent methods improve performance by concentrating on the transitions between real and fake audio. However, we observe that these models tend to over-rely on boundary artifacts while neglecting the manipulated content that follows. We argue that effective localization requires understanding the entire segments beyond just detecting transitions. Thus, we propose Segment-Aware Learning (SAL), a framework that encourages models to focus on the internal structure of segments. SAL introduces two core techniques: Segment Positional Labeling, which provides fine-grained frame supervision based on relative position within a segment; and Cross-Segment Mixing, a data augmentation method that generates diverse segment patterns. Experiments across multiple deepfake localization datasets show that SAL consistently achieves strong performance in both in-domain and out-of-domain settings, with notable gains in non-boundary regions and reduced reliance on transition artifacts. The code is available at https://github.com/SentryMao/SAL.
翻译:局部深度伪造音频(即仅部分语音片段被篡改)的定位仍然具有挑战性,因为这些篡改通常具有隐蔽性和分散性。现有方法通常依赖帧级预测来识别伪造片段,一些近期方法通过聚焦于真实与伪造音频之间的过渡区域来提升性能。然而,我们观察到这些模型往往过度依赖边界伪影,而忽略了后续被篡改的内容。我们认为,有效的定位需要理解整个片段,而不仅仅是检测过渡区域。因此,我们提出了分段感知学习(SAL)框架,该框架促使模型关注片段的内部结构。SAL引入了两项核心技术:分段位置标注(根据片段内的相对位置提供细粒度帧级监督)和跨片段混合(一种生成多样化片段模式的数据增强方法)。在多个深度伪造定位数据集上的实验表明,SAL在域内和域外设置下均能取得稳定且优异的性能,在非边界区域表现尤为突出,并显著降低了对过渡伪影的依赖。代码已发布于 https://github.com/SentryMao/SAL。