The visual pathway of human brain includes two sub-pathways, ie, the ventral pathway and the dorsal pathway, which focus on object identification and dynamic information modeling, respectively. Both pathways comprise multi-layer structures, with each layer responsible for processing different aspects of visual information. Inspired by visual information processing mechanism of the human brain, we propose the Brain Inspired Masked Modeling (BIMM) framework, aiming to learn comprehensive representations from videos. Specifically, our approach consists of ventral and dorsal branches, which learn image and video representations, respectively. Both branches employ the Vision Transformer (ViT) as their backbone and are trained using masked modeling method. To achieve the goals of different visual cortices in the brain, we segment the encoder of each branch into three intermediate blocks and reconstruct progressive prediction targets with light weight decoders. Furthermore, drawing inspiration from the information-sharing mechanism in the visual pathways, we propose a partial parameter sharing strategy between the branches during training. Extensive experiments demonstrate that BIMM achieves superior performance compared to the state-of-the-art methods.
翻译:摘要:人类大脑的视觉通路包含两条子通路,即腹侧通路和背侧通路,分别负责物体识别和动态信息建模。这两条通路均包含多层结构,每一层负责处理视觉信息的不同方面。受人类大脑视觉信息处理机制启发,我们提出了脑启发遮罩建模(BIMM)框架,旨在从视频中学习综合性表示。具体而言,我们的方法包含腹侧分支和背侧分支,分别学习图像表示和视频表示。两个分支均采用视觉Transformer(ViT)作为骨干网络,并使用遮罩建模方法进行训练。为实现大脑不同视觉皮层的功能目标,我们将每个分支的编码器分割为三个中间模块,并通过轻量级解码器重建渐进式预测目标。此外,受视觉通路中信息共享机制的启发,我们在训练过程中提出了分支间的部分参数共享策略。大量实验表明,BIMM相较于现有最优方法取得了更优的性能。