Flattening is essential in computer vision by converting multi-dimensional feature maps or images into one-dimensional vectors. However, existing flattening approaches neglect the preservation of local smoothness, which can impact the representational learning capacity of vision models. In this paper, we propose Hilbert curve flattening as an innovative method to preserve locality in flattened matrices. We compare it with the commonly used Zigzag operation and demonstrate that Hilbert curve flattening can better retain the spatial relationships and local smoothness of the original grid structure, while maintaining robustness against the input scale variance. And, we introduce the Localformer, a vision transformer architecture that incorporates Hilbert token sampling with a token aggregator to enhance its locality bias. Extensive experiments on image classification and semantic segmentation tasks demonstrate that the Localformer outperforms baseline models consistently. We also show it brings consistent performance boosts for other popular architectures (e.g. MLP-Mixer).
翻译:平坦化是计算机视觉中的关键操作,将多维特征图或图像转换为一维向量。然而,现有平坦化方法忽略了局部平滑性的保持,这可能影响视觉模型的表征学习能力。本文提出希尔伯特曲线平坦化作为一种创新方法,可在展开矩阵中保持局部性。我们将其与常用的Zigzag操作进行对比,证明希尔伯特曲线平坦化能够更好地保留原始网格结构的空间关系与局部平滑性,同时对输入尺度变化保持鲁棒性。此外,我们提出Localformer——一种视觉Transformer架构,通过引入希尔伯特令牌采样与令牌聚合器增强其局部性偏置。在图像分类与语义分割任务上的大量实验表明,Localformer始终优于基线模型。我们还发现该方法能为其他流行架构(如MLP-Mixer)带来持续的性能提升。