Decoding of seen visual contents with non-invasive brain recordings has important scientific and practical values. Efforts have been made to recover the seen images from brain signals. However, most existing approaches cannot faithfully reflect the visual contents due to insufficient image quality or semantic mismatches. Compared with reconstructing pixel-level visual images, speaking is a more efficient and effective way to explain visual information. Here we introduce a non-invasive neural decoder, termed as MindGPT, which interprets perceived visual stimuli into natural languages from fMRI signals. Specifically, our model builds upon a visually guided neural encoder with a cross-attention mechanism, which permits us to guide latent neural representations towards a desired language semantic direction in an end-to-end manner by the collaborative use of the large language model GPT. By doing so, we found that the neural representations of the MindGPT are explainable, which can be used to evaluate the contributions of visual properties to language semantics. Our experiments show that the generated word sequences truthfully represented the visual information (with essential details) conveyed in the seen stimuli. The results also suggested that with respect to language decoding tasks, the higher visual cortex (HVC) is more semantically informative than the lower visual cortex (LVC), and using only the HVC can recover most of the semantic information. The code of the MindGPT model will be publicly available at https://github.com/JxuanC/MindGPT.
翻译:利用非侵入性脑电记录解码视觉感知内容具有重要的科学和实用价值。目前,研究者已尝试从脑电信号中重建所看到的图像。然而,现有方法大多因图像质量不足或语义匹配不准确,无法忠实地反映视觉内容。相较于逐像素重建视觉图像,语言描述是一种更高效且有效的视觉信息解释方式。本文提出一种名为MindGPT的非侵入式神经解码器,可从fMRI信号中将感知到的视觉刺激解读为自然语言。具体而言,该模型基于视觉引导的神经编码器与交叉注意力机制,通过联合使用大型语言模型GPT,以端到端方式引导潜在神经表征向期望的语言语义方向演化。研究发现,MindGPT的神经表征具有可解释性,可用于评估视觉属性对语言语义的贡献。实验表明,生成的词汇序列能真实反映视觉刺激所传达的视觉信息(包含关键细节)。结果还显示,在语言解码任务中,高级视觉皮层(HVC)比低级视觉皮层(LVC)包含更丰富的语义信息,且仅使用高级视觉皮层即可恢复大部分语义内容。MindGPT模型代码将在https://github.com/JxuanC/MindGPT 公开。