Recent studies show that self-attentions behave like low-pass filters (as opposed to convolutions) and enhancing their high-pass filtering capability improves model performance. Contrary to this idea, we investigate existing convolution-based models with spectral analysis and observe that improving the low-pass filtering in convolution operations also leads to performance improvement. To account for this observation, we hypothesize that utilizing optimal token mixers that capture balanced representations of both high- and low-frequency components can enhance the performance of models. We verify this by decomposing visual features into the frequency domain and combining them in a balanced manner. To handle this, we replace the balancing problem with a mask filtering problem in the frequency domain. Then, we introduce a novel token-mixer named SPAM and leverage it to derive a MetaFormer model termed as SPANet. Experimental results show that the proposed method provides a way to achieve this balance, and the balanced representations of both high- and low-frequency components can improve the performance of models on multiple computer vision tasks. Our code is available at $\href{https://doranlyong.github.io/projects/spanet/}{\text{https://doranlyong.github.io/projects/spanet/}}$.
翻译:近期研究表明自注意力机制表现为低通滤波器(与卷积相反),增强其高通滤波能力可提升模型性能。与此观点相悖,我们通过频谱分析对现有卷积模型进行研究发现,提升卷积操作中的低通滤波能力同样能带来性能改进。基于这一发现,我们假设采用能均衡捕获高低频分量表征的最优令牌混合器可增强模型性能。我们通过将视觉特征分解至频域并以均衡方式重组来验证该假设。为此,我们将均衡问题转化为频域中的掩码滤波问题,进而提出名为SPAM的新型令牌混合器,并基于其构建MetaFormer架构模型SPANet。实验结果表明,所提方法为实现频域均衡提供了有效途径,高低频分量的均衡表征能在多项计算机视觉任务中提升模型性能。我们的代码开源在$\href{https://doranlyong.github.io/projects/spanet/}{\text{https://doranlyong.github.io/projects/spanet/}}$。