Finding Shared Decodable Concepts and their Negations in the Brain

Prior work has offered evidence for functional localization in the brain; different anatomical regions preferentially activate for certain types of visual input. For example, the fusiform face area preferentially activates for visual stimuli that include a face. However, the spectrum of visual semantics is extensive, and only a few semantically-tuned patches of cortex have so far been identified in the human brain. Using a multimodal (natural language and image) neural network architecture (CLIP) we train a highly accurate contrastive model that maps brain responses during naturalistic image viewing to CLIP embeddings. We then use a novel adaptation of the DBSCAN clustering algorithm to cluster the parameters of these participant-specific contrastive models. This reveals what we call Shared Decodable Concepts (SDCs): clusters in CLIP space that are decodable from common sets of voxels across multiple participants. Examining the images most and least associated with each SDC cluster gives us additional insight into the semantic properties of each SDC. We note SDCs for previously reported visual features (e.g. orientation tuning in early visual cortex) as well as visual semantic concepts such as faces, places and bodies. In cases where our method finds multiple clusters for a visuo-semantic concept, the least associated images allow us to dissociate between confounding factors. For example, we discovered two clusters of food images, one driven by color, the other by shape. We also uncover previously unreported areas such as regions of extrastriate body area (EBA) tuned for legs/hands and sensitivity to numerosity in right intraparietal sulcus, and more. Thus, our contrastive-learning methodology better characterizes new and existing visuo-semantic representations in the brain by leveraging multimodal neural network representations and a novel adaptation of clustering algorithms.

翻译：先前的研究为大脑功能定位提供了证据；不同的解剖区域对特定类型的视觉输入表现出优先激活。例如，梭状回面孔区对包含面孔的视觉刺激表现出优先激活。然而，视觉语义的范畴极为广泛，迄今为止在人脑中仅识别出少数具有语义调谐特性的皮层区域。通过使用多模态（自然语言与图像）神经网络架构（CLIP），我们训练了一个高精度的对比模型，该模型将自然图像观看期间的大脑响应映射到CLIP嵌入空间。随后，我们采用一种新颖的DBSCAN聚类算法变体，对这些参与者特异性对比模型的参数进行聚类分析。这种方法揭示了我们称之为"共享可解码概念"的现象：即在CLIP空间中存在能够从多位参与者共通的体素集合解码出的聚类簇。通过分析与每个SDC聚类最相关和最不相关的图像，我们可以进一步理解每个SDC的语义特性。我们不仅发现了先前报道的视觉特征对应的SDC（如早期视觉皮层的朝向调谐），还识别出涉及面孔、场景和身体等视觉语义概念的SDC。当我们的方法针对某个视觉语义概念发现多个聚类时，最不相关图像的分析有助于区分混淆因素。例如，我们发现了两个食物图像聚类簇：一个由颜色特征驱动，另一个由形状特征驱动。此外，我们还揭示了先前未报道的脑区特性，包括纹外体区中对腿部/手部具有调谐特性的区域、右侧顶内沟对数量信息的敏感性等。因此，我们的对比学习方法通过利用多模态神经网络表征和创新性的聚类算法改进，能够更好地刻画大脑中既存及新发现的视觉语义表征。