A Dual-Stream Neural Network Explains the Functional Segregation of Dorsal and Ventral Visual Pathways in Human Brains

The human visual system uses two parallel pathways for spatial processing and object recognition. In contrast, computer vision systems tend to use a single feedforward pathway, rendering them less robust, adaptive, or efficient than human vision. To bridge this gap, we developed a dual-stream vision model inspired by the human eyes and brain. At the input level, the model samples two complementary visual patterns to mimic how the human eyes use magnocellular and parvocellular retinal ganglion cells to separate retinal inputs to the brain. At the backend, the model processes the separate input patterns through two branches of convolutional neural networks (CNN) to mimic how the human brain uses the dorsal and ventral cortical pathways for parallel visual processing. The first branch (WhereCNN) samples a global view to learn spatial attention and control eye movements. The second branch (WhatCNN) samples a local view to represent the object around the fixation. Over time, the two branches interact recurrently to build a scene representation from moving fixations. We compared this model with the human brains processing the same movie and evaluated their functional alignment by linear transformation. The WhereCNN and WhatCNN branches were found to differentially match the dorsal and ventral pathways of the visual cortex, respectively, primarily due to their different learning objectives. These model-based results lead us to speculate that the distinct responses and representations of the ventral and dorsal streams are more influenced by their distinct goals in visual attention and object recognition than by their specific bias or selectivity in retinal inputs. This dual-stream model takes a further step in brain-inspired computer vision, enabling parallel neural networks to actively explore and understand the visual surroundings.

翻译：人类视觉系统采用两条并行通路分别处理空间信息与物体识别。相比之下，计算机视觉系统通常使用单一前馈通路，导致其在人眼视觉的鲁棒性、适应性和效率上存在差距。为弥合这一鸿沟，我们受人类眼睛和大脑启发开发了一种双流视觉模型。在输入层，该模型采样两种互补的视觉模式，模拟人眼通过大细胞和小细胞视网膜神经节细胞将视网膜输入分离至大脑的过程。在后端，模型通过两条卷积神经网络分支处理分离的输入模式，模拟人脑利用背侧和腹侧皮层通路进行并行视觉处理。第一条分支（WhereCNN）通过全局视图采样学习空间注意力并控制眼动，第二条分支（WhatCNN）通过局部视图采样表征注视点周围的物体。随时间推移，两条分支通过循环交互，从移动注视点构建场景表征。我们将该模型与观看同一影片的人脑进行对比，通过线性变换评估其功能对齐程度。研究发现WhereCNN和WhatCNN分支分别与视觉皮层背侧和腹侧通路产生差异化匹配，这主要源于两者不同的学习目标。基于这些模型结果，我们推测腹侧和背侧通路的差异响应与表征，更多受其视觉注意与物体识别的不同目标影响，而非视网膜输入的特异性偏倚或选择性。该双流模型在脑启发计算机视觉领域迈出新一步，使并行神经网络能够主动探索和理解视觉环境。