Attention mechanisms have become a key module in modern vision backbones due to their ability to model long-range dependencies. However, their quadratic complexity in sequence length and the difficulty of interpreting attention weights limit both scalability and clarity. Recent attention-free architectures demonstrate that strong performance can be achieved without pairwise attention, motivating the search for alternatives. In this work, we introduce Vision KAN (ViK), an attention-free backbone inspired by the Kolmogorov-Arnold Networks. At its core lies MultiPatch-RBFKAN, a unified token mixer that combines (a) patch-wise nonlinear transform with Radial Basis Function-based KANs, (b) axis-wise separable mixing for efficient local propagation, and (c) low-rank global mapping for long-range interaction. Employing as a drop-in replacement for attention modules, this formulation tackles the prohibitive cost of full KANs on high-resolution features by adopting a patch-wise grouping strategy with lightweight operators to restore cross-patch dependencies. Experiments on ImageNet-1K show that ViK achieves competitive accuracy with linear complexity, demonstrating the potential of KAN-based token mixing as an efficient and theoretically grounded alternative to attention.
翻译:注意力机制因其能够建模长程依赖关系,已成为现代视觉主干网络的关键模块。然而,其在序列长度上的二次复杂度以及注意力权重难以解释的特性,限制了模型的可扩展性和清晰度。最近的无注意力架构表明,无需成对注意力也能实现强大的性能,这促使人们寻找替代方案。在本工作中,我们提出了Vision KAN(ViK),一种受Kolmogorov-Arnold Networks启发的无注意力主干网络。其核心是MultiPatch-RBFKAN,这是一个统一的令牌混合器,它结合了:(a)基于径向基函数KAN的块间非线性变换,(b)轴可分离混合以实现高效的局部传播,以及(c)用于长程交互的低秩全局映射。该设计作为注意力模块的即插即用替代方案,通过采用块分组策略和轻量级算子来恢复块间依赖关系,从而解决了完整KAN在高分辨率特征上计算成本过高的问题。在ImageNet-1K上的实验表明,ViK以线性复杂度实现了具有竞争力的准确率,证明了基于KAN的令牌混合作为一种高效且理论依据充分的注意力替代方案的潜力。