Reconstructing perceived natural images or decoding their categories from fMRI signals are challenging tasks with great scientific significance. Due to the lack of paired samples, most existing methods fail to generate semantically recognizable reconstruction and are difficult to generalize to novel classes. In this work, we propose, for the first time, a task-agnostic brain decoding model by unifying the visual stimulus classification and reconstruction tasks in a semantic space. We denote it as BrainCLIP, which leverages CLIP's cross-modal generalization ability to bridge the modality gap between brain activities, images, and texts. Specifically, BrainCLIP is a VAE-based architecture that transforms fMRI patterns into the CLIP embedding space by combining visual and textual supervision. Note that previous works rarely use multi-modal supervision for visual stimulus decoding. Our experiments demonstrate that textual supervision can significantly boost the performance of decoding models compared to the condition where only image supervision exists. BrainCLIP can be applied to multiple scenarios like fMRI-to-image generation, fMRI-image-matching, and fMRI-text-matching. Compared with BraVL, a recently proposed multi-modal method for fMRI-based brain decoding, BrainCLIP achieves significantly better performance on the novel class classification task. BrainCLIP also establishes a new state-of-the-art for fMRI-based natural image reconstruction in terms of high-level image features.
翻译:摘要:从fMRI信号中重建感知到的自然图像或解码其类别是具有重要科学意义的挑战性任务。由于缺乏配对样本,现有方法大多无法生成语义可识别的重建结果,且难以泛化到新类别中。在本工作中,我们首次提出了一种任务无关的大脑解码模型,通过在语义空间中统一视觉刺激分类与重建任务。我们将其命名为BrainCLIP,它利用CLIP的跨模态泛化能力来弥合大脑活动、图像与文本之间的模态鸿沟。具体而言,BrainCLIP是一种基于VAE的架构,通过结合视觉与文本监督将fMRI模式映射到CLIP嵌入空间。值得注意的是,以往研究很少使用多模态监督进行视觉刺激解码。我们的实验表明,与仅使用图像监督的条件相比,文本监督能显著提升解码模型的性能。BrainCLIP可应用于fMRI到图像生成、fMRI-图像匹配以及fMRI-文本匹配等多种场景。与近期提出的基于fMRI大脑解码的多模态方法BraVL相比,BrainCLIP在新类别分类任务上取得了显著更优的性能。同时,BrainCLIP在高层次图像特征方面为基于fMRI的自然图像重建建立了新的最优水平。