Visual motion processing is essential for organisms to perceive and interact with dynamic environments. Despite extensive research in cognitive neuroscience, image-computable models that can extract informative motion flow from natural scenes in a manner consistent with human visual processing have yet to be established. Meanwhile, recent advancements in computer vision (CV), propelled by deep learning, have led to significant progress in optical flow estimation, a task closely related to motion perception. Here we propose an image-computable model of human motion perception by bridging the gap between human and CV models. Specifically, we introduce a novel two-stage approach that combines trainable motion energy sensing with a recurrent self-attention network for adaptive motion integration and segregation. This model architecture aims to capture the computations in V1-MT, the core structure for motion perception in the biological visual system. In silico neurophysiology reveals that our model's unit responses are similar to mammalian neural recordings regarding motion pooling and speed tuning. The proposed model can also replicate human responses to a range of stimuli examined in past psychophysical studies. The experimental results on the Sintel benchmark demonstrate that our model predicts human responses better than the ground truth, whereas the CV models show the opposite. Further partial correlation analysis indicates our model outperforms several state-of-the-art CV models in explaining the human responses that deviate from the ground truth. Our study provides a computational architecture consistent with human visual motion processing, although the physiological correspondence may not be exact.
翻译:视觉运动处理对于生物体感知动态环境并进行交互至关重要。尽管认知神经科学领域已有大量研究,但能够以与人类视觉处理一致的方式从自然场景中提取信息性运动流的可计算图像模型尚未建立。与此同时,受深度学习驱动的计算机视觉(CV)领域取得了显著进展,在光流估计(一项与运动感知密切相关的任务)方面尤为突出。本文通过弥合人类与CV模型之间的差距,提出了一个可计算图像的人类运动感知模型。具体而言,我们引入了一种新颖的两阶段方法,将可训练运动能量感知与用于自适应运动整合与分离的循环自注意力网络相结合。该模型架构旨在模拟生物视觉系统中运动感知核心结构V1-MT的计算过程。计算神经生理学显示,我们的模型单元响应在运动整合与速度调谐方面与哺乳动物神经记录相似。所提出的模型还能够复现过去心理物理学研究中多种刺激条件下的人类响应。Sintel基准测试的实验结果表明,我们的模型对人类响应的预测优于真实标注,而CV模型则呈现相反趋势。进一步的偏相关分析表明,在解释偏离真实标注的人类响应方面,我们的模型优于多种最先进的CV模型。本研究提供了一种与人类视觉运动处理一致的计算架构,尽管生理对应关系可能并非完全精确。