The quadratic compute and memory costs of global self-attention severely limit its use in high-resolution images. Local attention reduces complexity by restricting attention to neighborhoods. Block-sparse kernels can further improve the efficiency of local attention, but conventional local attention patterns often fail to deliver significant speedups because tokens within a window are not contiguous in the 1D sequence. This work proposes a novel method for constructing windows and neighborhoods based on the Hilbert curve. Image tokens are first reordered along a Hilbert curve, and windows and neighborhoods are then formed on the reordered 1D sequence. From a block-sparse perspective, this strategy significantly increases block sparsity and can be combined with existing block-sparse kernels to improve the efficiency of 2D local attention. Experiments show that the proposed Hilbert Window Attention and Hilbert Slide Attention can accelerate window attention and slide attention by about $4\times$ and $18\times$, respectively. To assess practicality, the strategy is instantiated as the Hilbert Window Transformer and the Hilbert Neighborhood Transformer, both of which achieve end-to-end speedups with minimal accuracy loss. Overall, combining Hilbert-guided local attention with block-sparse kernels offers a general and practical approach to enhancing the efficiency of 2D local attention for images.
翻译:全局自注意力机制的二次计算与内存开销严重限制了其在高分辨率图像中的应用。局部注意力通过将注意力限制在邻域内来降低复杂度。块稀疏核可以进一步提升局部注意力的效率,但传统的局部注意力模式往往无法实现显著的加速效果,因为窗口内的标记在一维序列中并非连续排列。本研究提出了一种基于希尔伯特曲线构建窗口与邻域的新方法。首先将图像标记沿希尔伯特曲线重新排序,随后在重排后的一维序列上构建窗口与邻域。从块稀疏的视角看,该策略显著提升了块稀疏度,并能与现有块稀疏核结合以提升二维局部注意力的效率。实验表明,所提出的希尔伯特窗口注意力与希尔伯特滑动注意力可分别将窗口注意力和滑动注意力加速约$4\times$和$18\times$。为评估实用性,该策略被实例化为希尔伯特窗口Transformer与希尔伯特邻域Transformer,两者均在精度损失极小的前提下实现了端到端加速。总体而言,将希尔伯特引导的局部注意力与块稀疏核相结合,为提升图像二维局部注意力效率提供了一种通用且实用的解决方案。