Recent vision and multimodal foundation backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress, enabling unified modeling across images, text, and beyond. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. In this work, we propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired foundation backbone that integrates hierarchical memory mechanisms with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, making decisions more interpretable, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances interpretability and data efficiency beyond existing self-attention- or state-space-based approaches. We conducted extensive experiments on public computer vision benchmarks, and V-HMN achieved competitive results against widely adopted backbone architectures, while offering better interpretability, higher data efficiency, and stronger biological plausibility. These findings highlight the potential of V-HMN to serve as a next-generation vision foundation model, while also providing a generalizable blueprint for multimodal backbones in domains such as text and audio, thereby bridging brain-inspired computation with large-scale machine learning.
翻译:近期,以Transformer系列和Mamba等状态空间模型为代表的视觉与多模态基础骨干网络取得了显著进展,能够对图像、文本及更多模态进行统一建模。尽管取得了实证成功,但这些架构距离人脑的计算原理仍有较大差距,往往需要海量训练数据,且可解释性有限。本文提出视觉Hopfield记忆网络(V-HMN),一种受大脑启发的、将层级记忆机制与迭代精炼更新相结合的基础骨干网络。具体而言,V-HMN包含:在图像块层面提供联想记忆动力学的局部Hopfield模块;充当情境情节记忆的全局Hopfield模块;以及受预测编码启发的、用于迭代误差校正的精炼规则。通过层级化组织这些基于记忆的模块,V-HMN在统一框架中同时捕获局部与全局动态过程。记忆检索揭示了输入与存储模式之间的关系,使决策更具可解释性,而存储模式的复用则提升了数据效率。因此,这种受大脑启发的设计在可解释性和数据效率上超越了现有基于自注意力或状态空间的方法。我们在公开计算机视觉基准上进行了广泛实验,结果表明,V-HMN在与广泛采用的基础骨干架构的对比中取得了具有竞争力的结果,同时展现出更好的可解释性、更高的数据效率以及更强的生物学合理性。这些发现凸显了V-HMN作为下一代视觉基础模型的潜力,同时也为文本和音频等领域的多模态骨干网络提供了一个可推广的蓝图,从而架起了脑启发计算与大规模机器学习之间的桥梁。