Transformers have revolutionized image modeling tasks with adaptations like DeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face challenges with inductive bias and high quadratic complexity, making them less efficient for high-resolution images. State space models (SSMs) such as Mamba, V-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images in computer vision tasks. These SSMs encounter two major issues. First, they become unstable when scaled to large network sizes. Second, although they efficiently capture global information in images, they inherently struggle with handling local information. To address these challenges, we introduce Heracles, a novel SSM that integrates a local SSM, a global SSM, and an attention-based token interaction module. Heracles leverages a Hartely kernel-based state space model for global image information, a localized convolutional network for local details, and attention mechanisms in deeper layers for token interactions. Our extensive experiments demonstrate that Heracles-C-small achieves state-of-the-art performance on the ImageNet dataset with 84.5\% top-1 accuracy. Heracles-C-Large and Heracles-C-Huge further improve accuracy to 85.9\% and 86.4\%, respectively. Additionally, Heracles excels in transfer learning tasks on datasets such as CIFAR-10, CIFAR-100, Oxford Flowers, and Stanford Cars, and in instance segmentation on the MSCOCO dataset. Heracles also proves its versatility by achieving state-of-the-art results on seven time-series datasets, showcasing its ability to generalize across domains with spectral data, capturing both local and global information. The project page is available at this link.\url{https://github.com/badripatro/heracles}
翻译:Transformer 模型通过 DeIT、Swin、SVT、Biformer、STVit 和 FDVIT 等变体革新了图像建模任务。然而,这些模型通常面临归纳偏置和高二次复杂度等挑战,使其在处理高分辨率图像时效率较低。状态空间模型(SSMs),如 Mamba、V-Mamba、ViM 和 SiMBA,为计算机视觉任务中处理高分辨率图像提供了一种替代方案。这些 SSM 存在两个主要问题。首先,当扩展到大型网络规模时,它们会变得不稳定。其次,尽管它们能有效捕获图像中的全局信息,但本质上难以处理局部信息。为了应对这些挑战,我们提出了 Heracles,一种新颖的 SSM,它集成了局部 SSM、全局 SSM 和基于注意力的令牌交互模块。Heracles 利用基于 Hartely 核的状态空间模型处理全局图像信息,利用局部卷积网络处理局部细节,并在更深层使用注意力机制进行令牌交互。我们的大量实验表明,Heracles-C-small 在 ImageNet 数据集上取得了最先进的性能,top-1 准确率达到 84.5%。Heracles-C-Large 和 Heracles-C-Huge 进一步将准确率分别提升至 85.9% 和 86.4%。此外,Heracles 在 CIFAR-10、CIFAR-100、Oxford Flowers 和 Stanford Cars 等数据集上的迁移学习任务,以及在 MSCOCO 数据集上的实例分割任务中均表现出色。Heracles 还通过在七个时间序列数据集上取得最先进的结果,证明了其多功能性,展示了其能够泛化到具有频谱数据的领域,并同时捕获局部和全局信息。项目页面可通过此链接访问:\url{https://github.com/badripatro/heracles}