While computer vision models have made incredible strides in static image recognition, they still do not match human performance in tasks that require the understanding of complex, dynamic motion. This is notably true for real-world scenarios where embodied agents face complex and motion-rich environments. Our approach leverages state-of-the-art video diffusion models to decouple static image representation from motion generation, enabling us to utilize fMRI brain activity for a deeper understanding of human responses to dynamic visual stimuli. Conversely, we also demonstrate that information about the brain's representation of motion can enhance the prediction of optical flow in artificial systems. Our novel approach leads to four main findings: (1) Visual motion, represented as fine-grained, object-level resolution optical flow, can be decoded from brain activity generated by participants viewing video stimuli; (2) Video encoders outperform image-based models in predicting video-driven brain activity; (3) Brain-decoded motion signals enable realistic video reanimation based only on the initial frame of the video; and (4) We extend prior work to achieve full video decoding from video-driven brain activity. This framework advances our understanding of how the brain represents spatial and temporal information in dynamic visual scenes. Our findings demonstrate the potential of combining brain imaging with video diffusion models for developing more robust and biologically-inspired computer vision systems. We show additional decoding and encoding examples on this site: https://sites.google.com/view/neural-dynamics/home.
翻译:尽管计算机视觉模型在静态图像识别方面取得了惊人的进展,但在需要理解复杂动态运动的任务中,它们仍无法匹及人类的表现。这在体现智能体面临复杂且富含运动信息的真实世界场景时尤为明显。我们的方法利用最先进的视频扩散模型,将静态图像表征与运动生成解耦,从而能够利用功能性磁共振成像(fMRI)的脑活动数据,更深入地理解人类对动态视觉刺激的反应。反之,我们也证明,关于大脑运动表征的信息能够增强人工系统中光流预测的性能。我们提出的新方法得出四项主要发现:(1)由参与者观看视频刺激所产生的大脑活动中,可以解码出以细粒度、对象级分辨率光流表征的视觉运动;(2)在预测视频驱动的大脑活动方面,视频编码器优于基于图像的模型;(3)基于大脑解码的运动信号,仅利用视频的初始帧即可实现逼真的视频复活动画;(4)我们扩展了先前的工作,实现了从视频驱动的大脑活动中进行完整视频解码。该框架推进了我们对大脑如何在动态视觉场景中表征空间与时间信息的理解。我们的研究结果表明,将脑成像技术与视频扩散模型相结合,具有开发更鲁棒、更具生物启发性的计算机视觉系统的潜力。更多解码与编码示例如下:https://sites.google.com/view/neural-dynamics/home。