Decoding natural visual scenes from brain activity has flourished, with extensive research in single-subject tasks and, however, less in cross-subject tasks. Reconstructing high-quality images in cross-subject tasks is a challenging problem due to profound individual differences between subjects and the scarcity of data annotation. In this work, we proposed MindTuner for cross-subject visual decoding, which achieves high-quality and rich semantic reconstructions using only 1 hour of fMRI training data benefiting from the phenomena of visual fingerprint in the human visual system and a novel fMRI-to-text alignment paradigm. Firstly, we pre-train a multi-subject model among 7 subjects and fine-tune it with scarce data on new subjects, where LoRAs with Skip-LoRAs are utilized to learn the visual fingerprint. Then, we take the image modality as the intermediate pivot modality to achieve fMRI-to-text alignment, which achieves impressive fMRI-to-text retrieval performance and corrects fMRI-to-image reconstruction with fine-tuned semantics. The results of both qualitative and quantitative analyses demonstrate that MindTuner surpasses state-of-the-art cross-subject visual decoding models on the Natural Scenes Dataset (NSD), whether using training data of 1 hour or 40 hours.
翻译:从大脑活动中解码自然视觉场景的研究已蓬勃发展,其中单被试任务得到广泛探索,而跨被试任务的研究相对较少。由于被试间显著的个体差异以及数据标注的稀缺性,在跨被试任务中重建高质量图像是一个具有挑战性的问题。在本工作中,我们提出了用于跨被试视觉解码的MindTuner,该方法得益于人类视觉系统中的视觉指纹现象以及一种新颖的fMRI-文本对齐范式,仅使用1小时的fMRI训练数据即可实现高质量且富含语义的重建。首先,我们在7名被试上预训练一个多被试模型,并在新被试的稀缺数据上对其进行微调,其中利用带有Skip-LoRA的LoRA来学习视觉指纹。随后,我们将图像模态作为中间枢纽模态,实现fMRI到文本的对齐,这不仅取得了令人印象深刻的fMRI-文本检索性能,还通过微调后的语义校正了fMRI到图像的重建。定性与定量分析的结果均表明,无论是在使用1小时还是40小时训练数据的情况下,MindTuner在自然场景数据集(NSD)上的表现均超越了最先进的跨被试视觉解码模型。