Heracles: A Hybrid SSM-Transformer Model for High-Resolution Image and Time-Series Analysis

Transformers have revolutionized image modeling tasks with adaptations like DeIT, Swin, SVT, Biformer, STVit, and FDVIT. However, these models often face challenges with inductive bias and high quadratic complexity, making them less efficient for high-resolution images. State space models (SSMs) such as Mamba, V-Mamba, ViM, and SiMBA offer an alternative to handle high resolution images in computer vision tasks. These SSMs encounter two major issues. First, they become unstable when scaled to large network sizes. Second, although they efficiently capture global information in images, they inherently struggle with handling local information. To address these challenges, we introduce Heracles, a novel SSM that integrates a local SSM, a global SSM, and an attention-based token interaction module. Heracles leverages a Hartely kernel-based state space model for global image information, a localized convolutional network for local details, and attention mechanisms in deeper layers for token interactions. Our extensive experiments demonstrate that Heracles-C-small achieves state-of-the-art performance on the ImageNet dataset with 84.5\% top-1 accuracy. Heracles-C-Large and Heracles-C-Huge further improve accuracy to 85.9\% and 86.4\%, respectively. Additionally, Heracles excels in transfer learning tasks on datasets such as CIFAR-10, CIFAR-100, Oxford Flowers, and Stanford Cars, and in instance segmentation on the MSCOCO dataset. Heracles also proves its versatility by achieving state-of-the-art results on seven time-series datasets, showcasing its ability to generalize across domains with spectral data, capturing both local and global information. The project page is available at this link.\url{https://github.com/badripatro/heracles}

翻译：Transformer 模型通过 DeIT、Swin、SVT、Biformer、STVit 和 FDVIT 等变体革新了图像建模任务。然而，这些模型通常面临归纳偏置和高二次复杂度等挑战，使其在处理高分辨率图像时效率较低。状态空间模型（SSMs），如 Mamba、V-Mamba、ViM 和 SiMBA，为计算机视觉任务中处理高分辨率图像提供了一种替代方案。这些 SSM 存在两个主要问题。首先，当扩展到大型网络规模时，它们会变得不稳定。其次，尽管它们能有效捕获图像中的全局信息，但本质上难以处理局部信息。为了应对这些挑战，我们提出了 Heracles，一种新颖的 SSM，它集成了局部 SSM、全局 SSM 和基于注意力的令牌交互模块。Heracles 利用基于 Hartely 核的状态空间模型处理全局图像信息，利用局部卷积网络处理局部细节，并在更深层使用注意力机制进行令牌交互。我们的大量实验表明，Heracles-C-small 在 ImageNet 数据集上取得了最先进的性能，top-1 准确率达到 84.5%。Heracles-C-Large 和 Heracles-C-Huge 进一步将准确率分别提升至 85.9% 和 86.4%。此外，Heracles 在 CIFAR-10、CIFAR-100、Oxford Flowers 和 Stanford Cars 等数据集上的迁移学习任务，以及在 MSCOCO 数据集上的实例分割任务中均表现出色。Heracles 还通过在七个时间序列数据集上取得最先进的结果，证明了其多功能性，展示了其能够泛化到具有频谱数据的领域，并同时捕获局部和全局信息。项目页面可通过此链接访问：\url{https://github.com/badripatro/heracles}

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日