In computer vision, a larger effective receptive field (ERF) is associated with better performance. While attention natively supports global context, its quadratic complexity limits its applicability to tasks that benefit from high-resolution input. In this work, we extend Hyena, a convolution-based attention replacement, from causal sequences to bidirectional data and two-dimensional image space. We scale Hyena's convolution kernels beyond the feature map size, up to 191$\times$191, to maximize ERF while maintaining sub-quadratic complexity in the number of pixels. We integrate our two-dimensional Hyena, HyenaPixel, and bidirectional Hyena into the MetaFormer framework. For image categorization, HyenaPixel and bidirectional Hyena achieve a competitive ImageNet-1k top-1 accuracy of 84.9% and 85.2%, respectively, with no additional training data, while outperforming other convolutional and large-kernel networks. Combining HyenaPixel with attention further improves accuracy. We attribute the success of bidirectional Hyena to learning the data-dependent geometric arrangement of pixels without a fixed neighborhood definition. Experimental results on downstream tasks suggest that HyenaPixel with large filters and a fixed neighborhood leads to better localization performance.
翻译:在计算机视觉领域,更大的有效感受野通常与更优的性能相关联。尽管注意力机制天然支持全局上下文建模,但其二次复杂度限制了其在高分辨率输入任务中的应用。本研究将Hyena——一种基于卷积的注意力替代机制——从因果序列扩展到双向数据及二维图像空间。我们将Hyena的卷积核尺寸扩展至特征图尺寸之上(最大达191×191),在保持像素数量上次二次复杂度的同时最大化有效感受野。我们将二维Hyena(HyenaPixel)及双向Hyena整合至MetaFormer架构中。在图像分类任务中,HyenaPixel与双向Hyena在未使用额外训练数据的情况下,分别取得了84.9%和85.2%的ImageNet-1k top-1准确率,其性能优于其他卷积网络与大核网络,且与主流模型具有竞争力。将HyenaPixel与注意力机制结合可进一步提升准确率。我们认为双向Hyena的成功源于其能够学习数据依赖的像素几何排布,而无需预设固定的邻域定义。下游任务的实验结果表明,采用大滤波器与固定邻域的HyenaPixel具有更优的定位性能。