Recent studies have shown that 2D convolution and self-attention exhibit distinct spectral behaviors, and optimizing their spectral properties can enhance vision model performance. However, theoretical analyses remain limited in explaining why 2D convolution is more effective in high-pass filtering than self-attention and why larger kernels favor shape bias, akin to self-attention. In this paper, we employ graph spectral analysis to theoretically simulate and compare the frequency responses of 2D convolution and self-attention within a unified framework. Our results corroborate previous empirical findings and reveal that node connectivity, modulated by window size, is a key factor in shaping spectral functions. Leveraging this insight, we introduce a \textit{spectral-adaptive modulation} (SPAM) mixer, which processes visual features in a spectral-adaptive manner using multi-scale convolutional kernels and a spectral re-scaling mechanism to refine spectral components. Based on SPAM, we develop SPANetV2 as a novel vision backbone. Extensive experiments demonstrate that SPANetV2 outperforms state-of-the-art models across multiple vision tasks, including ImageNet-1K classification, COCO object detection, and ADE20K semantic segmentation.
翻译:近期研究表明,二维卷积与自注意力机制展现出不同的频谱特性,优化其频谱属性能够提升视觉模型的性能。然而,现有理论分析在解释为何二维卷积在高通滤波方面比自注意力更有效,以及为何更大的卷积核更倾向于形状偏置(类似于自注意力)方面仍显不足。本文采用图谱分析理论,在统一框架内对二维卷积与自注意力的频率响应进行了理论模拟与比较。我们的结果验证了先前的实证发现,并揭示了由窗口大小调制的节点连接性是塑造频谱函数的关键因素。基于这一见解,我们提出了一种\textit{谱自适应调制}(SPAM)混合器,该混合器利用多尺度卷积核以谱自适应方式处理视觉特征,并通过谱重缩放机制优化频谱分量。基于SPAM,我们开发了SPANetV2作为新型视觉骨干网络。大量实验表明,SPANetV2在多项视觉任务中均优于最先进的模型,包括ImageNet-1K图像分类、COCO目标检测以及ADE20K语义分割。