Humans and animals have a rich and flexible understanding of the physical world, which enables them to infer the underlying dynamical trajectories of objects and events, plausible future states, and use that to plan and anticipate the consequences of actions. However, the neural mechanisms underlying these computations are unclear. We combine a goal-driven modeling approach with dense neurophysiological data and high-throughput human behavioral readouts to directly impinge on this question. Specifically, we construct and evaluate several classes of sensory-cognitive networks to predict the future state of rich, ethologically-relevant environments, ranging from self-supervised end-to-end models with pixel-wise or object-centric objectives, to models that future predict in the latent space of purely static image-based or dynamic video-based pretrained foundation models. We find strong differentiation across these model classes in their ability to predict neural and behavioral data both within and across diverse environments. In particular, we find that neural responses are currently best predicted by models trained to predict the future state of their environment in the latent space of pretrained foundation models optimized for dynamic scenes in a self-supervised manner. Notably, models that future predict in the latent space of video foundation models that are optimized to support a diverse range of sensorimotor tasks, reasonably match both human behavioral error patterns and neural dynamics across all environmental scenarios that we were able to test. Overall, these findings suggest that the neural mechanisms and behaviors of primate mental simulation are thus far most consistent with being optimized to future predict on dynamic, reusable visual representations that are useful for embodied AI more generally.
翻译:人类和动物对物理世界具有丰富而灵活的理解能力,这使其能够推断物体与事件的潜在动态轨迹、预判可能的未来状态,并据此规划行动、预测行为后果。然而,支撑这些计算的神经机制尚不明确。我们采用目标驱动建模方法,结合密集的神经生理数据与高通量人类行为读数,直接探讨该问题。具体而言,我们构建并评估了多类感觉-认知网络,以预测与生态学高度相关的复杂环境的未来状态——涵盖从基于像素级或物体级目标的自监督端到端模型,到在纯静态图像或动态视频预训练基础模型的潜在空间中进行未来预测的模型。我们发现,这些模型类别在预测跨环境及环境内神经与行为数据的能力上存在显著分化。尤其值得注意的是,当前最佳预测神经响应的模型,是那些在针对动态场景进行自监督优化的预训练基础模型的潜在空间中,训练以预测环境未来状态的模型。特别地,在支持多样化感觉运动任务的视频基础模型潜在空间中执行未来预测的模型,能合理匹配我们所有可测试环境场景中的人类行为错误模式与神经动态。总体而言,这些发现表明,灵长类动物心理模拟的神经机制与行为,目前最符合以动态、可复用的视觉表征为对象进行未来预测的优化目标——这类表征广义上有益于具身人工智能系统。