Decoding visual-semantic information from brain signals, such as functional MRI (fMRI), across different subjects poses significant challenges, including low signal-to-noise ratio, limited data availability, and cross-subject variability. Recent advancements in large language models (LLMs) show remarkable effectiveness in processing multimodal information. In this study, we introduce an LLM-based approach for reconstructing visual-semantic information from fMRI signals elicited by video stimuli. Specifically, we employ fine-tuning techniques on an fMRI encoder equipped with adaptors to transform brain responses into latent representations aligned with the video stimuli. Subsequently, these representations are mapped to textual modality by LLM. In particular, we integrate self-supervised domain adaptation methods to enhance the alignment between visual-semantic information and brain responses. Our proposed method achieves good results using various quantitative semantic metrics, while yielding similarity with ground-truth information.
翻译:从不同受试者的脑信号(如功能性磁共振成像(fMRI))中解码视觉语义信息面临诸多重大挑战,包括低信噪比、数据可用性有限以及跨受试者变异性。大语言模型(LLMs)的最新进展在处理多模态信息方面展现出显著的有效性。在本研究中,我们提出了一种基于LLM的方法,用于从视频刺激引发的fMRI信号中重建视觉语义信息。具体而言,我们采用微调技术,在一个配备适配器的fMRI编码器上,将大脑响应转化为与视频刺激对齐的潜在表征。随后,这些表征通过LLM映射到文本模态。特别地,我们整合了自监督域适应方法,以增强视觉语义信息与大脑响应之间的对齐。我们提出的方法在使用多种定量语义度量时取得了良好结果,同时与真实信息具有较高的相似性。