Decoding natural visual scenes from brain activity has flourished, with extensive research in single-subject tasks and, however, less in cross-subject tasks. Reconstructing high-quality images in cross-subject tasks is a challenging problem due to profound individual differences between subjects and the scarcity of data annotation. In this work, we proposed MindTuner for cross-subject visual decoding, which achieves high-quality and rich-semantic reconstructions using only 1 hour of fMRI training data benefiting from the phenomena of visual fingerprint in the human visual system and a novel fMRI-to-text alignment paradigm. Firstly, we pre-train a multi-subject model among 7 subjects and fine-tune it with scarce data on new subjects, where LoRAs with Skip-LoRAs are utilized to learn the visual fingerprint. Then, we take the image modality as the intermediate pivot modality to achieve fMRI-to-text alignment, which achieves impressive fMRI-to-text retrieval performance and corrects fMRI-to-image reconstruction with fine-tuned semantics. The results of both qualitative and quantitative analyses demonstrate that MindTuner surpasses state-of-the-art cross-subject visual decoding models on the Natural Scenes Dataset (NSD), whether using training data of 1 hour or 40 hours.
翻译:从大脑活动中解码自然视觉场景的研究已蓬勃发展,单被试任务研究广泛,而跨被试任务研究相对较少。由于被试间存在显著个体差异且数据标注稀缺,在跨被试任务中重建高质量图像是一项具有挑战性的问题。本文提出面向跨被试视觉解码的MindTuner方法,利用人类视觉系统中的视觉指纹现象以及新型fMRI-文本对齐范式,仅需1小时的fMRI训练数据即可实现高质量、富含语义的重建。首先,我们在7名被试上预训练多被试模型,并通过稀疏数据在新被试上进行微调,其中采用LoRA与Skip-LoRA学习视觉指纹。其次,以图像模态作为中间枢轴模态实现fMRI-文本对齐,该对齐在fMRI-文本检索任务中取得显著性能,并通过精调语义校正fMRI-图像重建结果。定性与定量分析结果表明,在自然场景数据集(NSD)上,无论采用1小时还是40小时训练数据,MindTuner均超越当前最先进的跨被试视觉解码模型。