The transformer networks are extensively utilized in face forgery detection due to their scalability across large datasets.Despite their success, transformers face challenges in balancing the capture of global context, which is crucial for unveiling forgery clues, with computational complexity.To mitigate this issue, we introduce Band-Attention modulated RetNet (BAR-Net), a lightweight network designed to efficiently process extensive visual contexts while avoiding catastrophic forgetting.Our approach empowers the target token to perceive global information by assigning differential attention levels to tokens at varying distances. We implement self-attention along both spatial axes, thereby maintaining spatial priors and easing the computational burden.Moreover, we present the adaptive frequency Band-Attention Modulation mechanism, which treats the entire Discrete Cosine Transform spectrogram as a series of frequency bands with learnable weights.Together, BAR-Net achieves favorable performance on several face forgery datasets, outperforming current state-of-the-art methods.
翻译:Transformer网络因其在大规模数据集上的可扩展性而被广泛用于人脸伪造检测。尽管取得了成功,但Transformer在平衡全局上下文捕获(这对揭示伪造线索至关重要)与计算复杂度方面仍面临挑战。为缓解这一问题,我们引入了频带注意力调制RetNet(BAR-Net),这是一种轻量级网络,旨在高效处理广泛的视觉上下文,同时避免灾难性遗忘。我们的方法通过为不同距离的令牌分配差异化的注意力级别,使目标令牌能够感知全局信息。我们沿空间轴实现自注意力,从而保持空间先验并减轻计算负担。此外,我们提出了自适应频带注意力调制机制,将整个离散余弦变换频谱图视为一系列具有可学习权重的频带。BAR-Net在多个面部伪造数据集上取得了优越性能,超越了当前最先进的方法。