Understanding the functional organization of higher visual cortex is a central focus in neuroscience. Past studies have primarily mapped the visual and semantic selectivity of neural populations using hand-selected stimuli, which may potentially bias results towards pre-existing hypotheses of visual cortex functionality. Moving beyond conventional approaches, we introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest. Our method -- Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the rich embedding space learned by a contrastive vision-language model and utilizes a pre-trained large language model to generate interpretable captions. We validate our method through fine-grained voxel-level captioning across higher-order visual regions. We further perform text-conditioned image synthesis with the captions, and show that our images are semantically coherent and yield high predicted activations. Finally, to demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of "person" representations in the brain, and discover fine-grained semantic selectivity in body-selective areas. Unlike earlier studies that decode text, our method derives voxel-wise captions of semantic selectivity. Our results show that BrainSCUBA is a promising means for understanding functional preferences in the brain, and provides motivation for further hypothesis-driven investigation of visual cortex.
翻译:理解高级视觉皮层的功能组织是神经科学的核心关注点。过去的研究主要使用人工筛选的刺激来映射神经群体的视觉和语义选择性,这可能导致结果偏向于对视觉皮层功能的既有假设。超越传统方法,我们提出一种数据驱动方法,为预测能最大激活特定感兴趣体素的图像生成自然语言描述。我们的方法——基于脑对齐的语义描述("BrainSCUBA")——建立在对比视觉-语言模型学到的丰富嵌入空间之上,并利用预训练大型语言模型生成可解释的描述。我们通过跨高级视觉区域的细粒度体素级描述验证了该方法。进一步,我们利用这些描述进行文本条件图像合成,表明我们的图像在语义上具有连贯性并能产生高预测激活。最后,为展示该方法如何促进科学发现,我们对大脑中"人"表征的分布进行探索性调查,发现了身体选择区域的细粒度语义选择性。与早期解码文本的研究不同,我们的方法推导出语义选择性的体素级描述。结果表明,BrainSCUBA是理解大脑功能偏好的一种有前景的手段,并为后续假设驱动的视觉皮层研究提供了动力。