In vision tasks, a larger effective receptive field (ERF) is associated with better performance. While attention natively supports global context, convolution requires multiple stacked layers and a hierarchical structure for large context. In this work, we extend Hyena, a convolution-based attention replacement, from causal sequences to the non-causal two-dimensional image space. We scale the Hyena convolution kernels beyond the feature map size up to 191$\times$191 to maximize the ERF while maintaining sub-quadratic complexity in the number of pixels. We integrate our two-dimensional Hyena, HyenaPixel, and bidirectional Hyena into the MetaFormer framework. For image categorization, HyenaPixel and bidirectional Hyena achieve a competitive ImageNet-1k top-1 accuracy of 83.0% and 83.5%, respectively, while outperforming other large-kernel networks. Combining HyenaPixel with attention further increases accuracy to 83.6%. We attribute the success of attention to the lack of spatial bias in later stages and support this finding with bidirectional Hyena.
翻译:在视觉任务中,更大的有效感受野通常与更优的性能相关。尽管注意力机制天然支持全局上下文,但卷积需要多个堆叠层和层次化结构才能实现大范围上下文。本文中,我们将基于卷积的注意力替代方案Hyena从因果序列扩展至非因果的二维图像空间。我们将Hyena卷积核缩放至超过特征图尺寸(最大达191×191),以在保持像素数量亚二次复杂度的情况下最大化有效感受野。我们将二维Hyena(HyenaPixel)和双向Hyena集成到MetaFormer框架中。在图像分类任务中,HyenaPixel和双向Hyena分别以83.0%和83.5%的ImageNet-1k top-1准确率表现出竞争力,同时超越其他大核网络。将HyenaPixel与注意力机制结合后,准确率进一步提升至83.6%。我们将注意力的成功归因于后期阶段缺乏空间偏差,并通过双向Hyena验证了这一发现。