Reconstructing visual stimuli from measured functional magnetic resonance imaging (fMRI) has been a meaningful and challenging task. Previous studies have successfully achieved reconstructions with structures similar to the original images, such as the outlines and size of some natural images. However, these reconstructions lack explicit semantic information and are difficult to discern. In recent years, many studies have utilized multi-modal pre-trained models with stronger generative capabilities to reconstruct images that are semantically similar to the original ones. However, these images have uncontrollable structural information such as position and orientation. To address both of the aforementioned issues simultaneously, we propose a two-stage image reconstruction model called MindDiffuser, utilizing Stable Diffusion. In Stage 1, the VQ-VAE latent representations and the CLIP text embeddings decoded from fMRI are put into the image-to-image process of Stable Diffusion, which yields a preliminary image that contains semantic and structural information. In Stage 2, we utilize the low-level CLIP visual features decoded from fMRI as supervisory information, and continually adjust the two features in Stage 1 through backpropagation to align the structural information. The results of both qualitative and quantitative analyses demonstrate that our proposed model has surpassed the current state-of-the-art models in terms of reconstruction results on Natural Scenes Dataset (NSD). Furthermore, the results of ablation experiments indicate that each component of our model is effective for image reconstruction.
翻译:从测量的功能性磁共振成像(fMRI)中重建视觉刺激一直是一项有意义且具有挑战性的任务。先前研究已成功实现与原图结构相似的重建,例如某些自然图像的轮廓和尺寸。然而,这些重建缺乏明确的语义信息且难以辨识。近年来,许多研究利用具有更强生成能力的多模态预训练模型重建出与原图语义相似的图像,但这些图像在位置、方向等结构信息上不可控。为同时解决上述两个问题,我们提出了一种名为MindDiffuser的两阶段图像重建模型,该模型基于Stable Diffusion。在第一阶段,将从fMRI解码的VQ-VAE潜在表示和CLIP文本嵌入输入Stable Diffusion的图像到图像流程,生成包含语义和结构信息的初步图像。在第二阶段,我们利用从fMRI解码的低层CLIP视觉特征作为监督信息,通过反向传播持续调整第一阶段中的两个特征以对齐结构信息。定性和定量分析结果均表明,在自然场景数据集(NSD)上的重建效果方面,我们所提出的模型已超越当前最先进模型。此外,消融实验结果表明,我们模型的每个组件对图像重建均有效。