The transformer networks are extensively utilized in face forgery detection due to their scalability across large datasets.Despite their success, transformers face challenges in balancing the capture of global context, which is crucial for unveiling forgery clues, with computational complexity.To mitigate this issue, we introduce Band-Attention modulated RetNet (BAR-Net), a lightweight network designed to efficiently process extensive visual contexts while avoiding catastrophic forgetting.Our approach empowers the target token to perceive global information by assigning differential attention levels to tokens at varying distances. We implement self-attention along both spatial axes, thereby maintaining spatial priors and easing the computational burden.Moreover, we present the adaptive frequency Band-Attention Modulation mechanism, which treats the entire Discrete Cosine Transform spectrogram as a series of frequency bands with learnable weights.Together, BAR-Net achieves favorable performance on several face forgery datasets, outperforming current state-of-the-art methods.
翻译:Transformer网络因其在大规模数据集上的可扩展性,被广泛应用于人脸伪造检测领域。尽管取得了显著成功,但Transformer在平衡全局上下文捕获(这对于揭示伪造线索至关重要)与计算复杂度方面仍面临挑战。为缓解此问题,我们提出了基于频带注意力调制的RetNet(BAR-Net),这是一种轻量级网络,旨在高效处理大量视觉上下文的同时避免灾难性遗忘。我们的方法通过为不同距离的token分配差异化的注意力级别,使目标token能够感知全局信息。我们沿两个空间轴实现自注意力,从而保持空间先验并减轻计算负担。此外,我们提出了自适应频率的频带注意力调制机制,该机制将整个离散余弦变换频谱图视为一系列具有可学习权重的频带。综合而言,BAR-Net在多个面部伪造数据集上实现了优越的性能,超越了当前最先进的方法。