Reconstructing visual stimuli from brain recordings has been a meaningful and challenging task. Especially, the achievement of precise and controllable image reconstruction bears great significance in propelling the progress and utilization of brain-computer interfaces. Despite the advancements in complex image reconstruction techniques, the challenge persists in achieving a cohesive alignment of both semantic (concepts and objects) and structure (position, orientation, and size) with the image stimuli. To address the aforementioned issue, we propose a two-stage image reconstruction model called MindDiffuser. In Stage 1, the VQ-VAE latent representations and the CLIP text embeddings decoded from fMRI are put into Stable Diffusion, which yields a preliminary image that contains semantic information. In Stage 2, we utilize the CLIP visual feature decoded from fMRI as supervisory information, and continually adjust the two feature vectors decoded in Stage 1 through backpropagation to align the structural information. The results of both qualitative and quantitative analyses demonstrate that our model has surpassed the current state-of-the-art models on Natural Scenes Dataset (NSD). The subsequent experimental findings corroborate the neurobiological plausibility of the model, as evidenced by the interpretability of the multimodal feature employed, which align with the corresponding brain responses.
翻译:从脑记录中重建视觉刺激一直是一项有意义且具有挑战性的任务。特别是,实现精确且可控的图像重建对于推动脑机接口的进步与应用具有重要意义。尽管复杂图像重建技术取得了进展,但在实现与图像刺激的语义(概念和对象)和结构(位置、方向和大小)的协同对齐方面仍存在挑战。为解决上述问题,我们提出了一种名为MindDiffuser的两阶段图像重建模型。在第一阶段,将fMRI解码得到的VQ-VAE潜在表示和CLIP文本嵌入输入到Stable Diffusion中,生成包含语义信息的初步图像。在第二阶段,我们利用从fMRI解码的CLIP视觉特征作为监督信息,通过反向传播持续调整第一阶段解码得到的两个特征向量,以对齐结构信息。定性和定量分析结果表明,我们的模型在自然场景数据集(NSD)上已超越当前最先进的模型。后续实验结果证实了该模型的神经生物学合理性,所采用的多模态特征的可解释性与其对应的脑响应一致。