Vision CNNs trained to estimate spatial latents learned similar ventral-stream-aligned representations

Studies of the functional role of the primate ventral visual stream have traditionally focused on object categorization, often ignoring -- despite much prior evidence -- its role in estimating "spatial" latents such as object position and pose. Most leading ventral stream models are derived by optimizing networks for object categorization, which seems to imply that the ventral stream is also derived under such an objective. Here, we explore an alternative hypothesis: Might the ventral stream be optimized for estimating spatial latents? And a closely related question: How different -- if at all -- are representations learned from spatial latent estimation compared to categorization? To ask these questions, we leveraged synthetic image datasets generated by a 3D graphic engine and trained convolutional neural networks (CNNs) to estimate different combinations of spatial and category latents. We found that models trained to estimate just a few spatial latents achieve neural alignment scores comparable to those trained on hundreds of categories, and the spatial latent performance of models strongly correlates with their neural alignment. Spatial latent and category-trained models have very similar -- but not identical -- internal representations, especially in their early and middle layers. We provide evidence that this convergence is partly driven by non-target latent variability in the training data, which facilitates the implicit learning of representations of those non-target latents. Taken together, these results suggest that many training objectives, such as spatial latents, can lead to similar models aligned neurally with the ventral stream. Thus, one should not assume that the ventral stream is optimized for object categorization only. As a field, we need to continue to sharpen our measures of comparing models to brains to better understand the functional roles of the ventral stream.

翻译：传统上对灵长类腹侧视觉通路功能角色的研究多集中于物体分类，尽管存在大量先验证据，却往往忽略了其在估计物体位置、姿态等“空间”潜在变量方面的作用。当前主流的腹侧通路模型大多通过优化网络以实现物体分类任务而得到，这似乎暗示腹侧通路本身也是在此目标下演化形成的。本文探讨一个替代性假设：腹侧通路是否可能针对空间潜在变量的估计进行了优化？以及一个密切相关的问题：通过空间潜在变量估计学习到的表征与通过分类学习到的表征究竟有多大差异（如果存在差异的话）？为探究这些问题，我们利用三维图形引擎生成的合成图像数据集，训练卷积神经网络（CNNs）以估计不同组合的空间与类别潜在变量。我们发现，仅训练估计少数几个空间潜在变量的模型，其神经对齐分数与训练于数百个类别的模型相当，并且模型的空间潜在变量估计性能与其神经对齐程度高度相关。空间潜在变量训练模型与类别训练模型具有极其相似（但非完全一致）的内部表征，尤其是在其早期和中间层。我们提供的证据表明，这种趋同性部分源于训练数据中非目标潜在变量的变异性，这种变异性促进了模型对这些非目标潜在变量表征的隐式学习。综上所述，这些结果表明，包括空间潜在变量估计在内的多种训练目标，都能导向与腹侧通路神经对齐的相似模型。因此，我们不应假定腹侧通路仅针对物体分类进行了优化。作为一个研究领域，我们需要持续完善比较模型与大脑的度量方法，以更深入地理解腹侧通路的功能角色。