Understanding the functional organization of higher visual cortex is a central focus in neuroscience. Past studies have primarily mapped the visual and semantic selectivity of neural populations using hand-selected stimuli, which may potentially bias results towards pre-existing hypotheses of visual cortex functionality. Moving beyond conventional approaches, we introduce a data-driven method that generates natural language descriptions for images predicted to maximally activate individual voxels of interest. Our method -- Semantic Captioning Using Brain Alignments ("BrainSCUBA") -- builds upon the rich embedding space learned by a contrastive vision-language model and utilizes a pre-trained large language model to generate interpretable captions. We validate our method through fine-grained voxel-level captioning across higher-order visual regions. We further perform text-conditioned image synthesis with the captions, and show that our images are semantically coherent and yield high predicted activations. Finally, to demonstrate how our method enables scientific discovery, we perform exploratory investigations on the distribution of "person" representations in the brain, and discover fine-grained semantic selectivity in body-selective areas. Unlike earlier studies that decode text, our method derives voxel-wise captions of semantic selectivity. Our results show that BrainSCUBA is a promising means for understanding functional preferences in the brain, and provides motivation for further hypothesis-driven investigation of visual cortex.
翻译:理解高级视觉皮层的功能组织是神经科学的核心课题之一。以往研究主要利用人工选取的刺激来映射神经群体对视觉和语义信息的选择性,这可能导致结果偏向于对视觉皮层功能的既有假设。为突破传统方法的局限,我们提出一种数据驱动的方法,能够为预测可最大程度激活特定感兴趣体素的图像生成自然语言描述。该方法——基于大脑对齐的语义描述("BrainSCUBA")——依托于对比视觉语言模型学习到的丰富嵌入空间,并利用预训练大语言模型生成可解释的描述。我们通过在高级视觉区域进行细粒度体素级描述验证了该方法。进一步地,我们利用描述进行文本条件下的图像合成,结果显示我们合成的图像在语义上具有连贯性,且能产生高预测激活值。最后,为展示该方法如何推动科学发现,我们针对大脑中"人"表征的分布开展了探索性研究,并在身体选择性区域发现了细粒度的语义选择性。与早期解码文本的研究不同,我们的方法能够从语义选择性角度推导出每个体素的描述。研究结果表明,BrainSCUBA是理解大脑功能偏好的一种有前景的手段,并为后续基于假设的视觉皮层研究提供了推动力。