Previous studies have shown that it is possible to map brain activation data of subjects viewing images onto the feature representation space of not only vision models (modality-specific decoding) but also language models (cross-modal decoding). In this work, we introduce and use a new large-scale fMRI dataset (~8,500 trials per subject) of people watching both images and text descriptions of such images. This novel dataset enables the development of modality-agnostic decoders: a single decoder that can predict which stimulus a subject is seeing, irrespective of the modality (image or text) in which the stimulus is presented. We train and evaluate such decoders to map brain signals onto stimulus representations from a large range of publicly available vision, language and multimodal (vision+language) models. Our findings reveal that (1) modality-agnostic decoders perform as well as (and sometimes even better than) modality-specific decoders (2) modality-agnostic decoders mapping brain data onto representations from unimodal models perform as well as decoders relying on multimodal representations (3) while language and low-level visual (occipital) brain regions are best at decoding text and image stimuli, respectively, high-level visual (temporal) regions perform well on both stimulus types.
翻译:先前研究已表明,可将观看图像时采集的脑激活数据映射至视觉模型(模态特异性解码)乃至语言模型(跨模态解码)的特征表征空间。本研究引入并采用全新的大规模fMRI数据集(每名受试者约8500次实验),记录受试者同时观察图像及其文本描述时的脑活动。这一创新数据集使得开发模态无关解码器成为可能:即单一解码器能预测受试者当前所感知的刺激内容,无论该刺激以何种模态(图像或文本)呈现。我们训练并评估了此类解码器,将其脑信号映射至来自多种公开视觉、语言及多模态(视觉+语言)模型的刺激表征。研究结果表明:(1)模态无关解码器的性能与模态特异性解码器相当(有时甚至更优);(2)将脑数据映射至单模态模型表征的模态无关解码器,其表现与依赖多模态表征的解码器不相上下;(3)语言区与低级视觉(枕叶)脑区分别对文本和图像刺激解码效果最佳,而高级视觉(颞叶)脑区对两种刺激类型均表现优异。