Feedforward artificial neural networks (ANNs) trained on static images remain the dominant models of the the primate ventral visual stream, yet they are intrinsically limited to static computations. The primate world is dynamic, and the macaque ventral visual pathways, specifically the inferior temporal (IT) cortex not only supports object recognition but also encodes object motion velocity during naturalistic video viewing. Does IT's temporal responses reflect nothing more than time-unfolded feedforward transformations, framewise features with shallow temporal pooling, or do they embody richer dynamic computations? We tested this by comparing macaque IT responses during naturalistic videos against static, recurrent, and video-based ANN models. Video models provided modest improvements in neural predictivity, particularly at later response stages, raising the question of what kind of dynamics they capture. To probe this, we applied a stress test: decoders trained on naturalistic videos were evaluated on "appearance-free" variants that preserve motion but remove shape and texture. IT population activity generalized across this manipulation, but all ANN classes failed. Thus, current video models better capture appearance-bound dynamics rather than the appearance-invariant temporal computations expressed in IT, underscoring the need for new objectives that encode biological temporal statistics and invariances.
翻译:在静态图像上训练的前馈人工神经网络(ANNs)仍然是灵长类腹侧视觉通路的主导模型,但其本质上仅限于静态计算。灵长类的世界是动态的,猕猴的腹侧视觉通路——特别是颞下区(IT)皮层——不仅支持物体识别,还在自然视频观看过程中编码物体运动速度。IT的时间响应究竟仅反映了时间展开的前馈变换、逐帧特征与浅层时间池化的结果,还是体现了更丰富的动态计算?我们通过比较猕猴在观看自然视频时的IT响应与静态、循环及基于视频的ANN模型来检验这一问题。视频模型在神经预测性方面提供了适度改进,尤其是在后期响应阶段,这引发了关于它们捕获何种动态特性的疑问。为探究此问题,我们实施了压力测试:在自然视频上训练的解码器被用于评估“外观消除”变体视频(保留运动但移除形状与纹理)。IT群体活动在此操作中表现出良好的泛化能力,但所有ANN类别均告失败。因此,当前视频模型更擅长捕捉与外观绑定的动态特性,而非IT所表达的外观不变性时间计算,这凸显了需要新目标函数来编码生物时间统计特性与不变性的迫切需求。