Reconstructing human vision from brain activities has been an appealing task that helps to understand our cognitive process. Even though recent research has seen great success in reconstructing static images from non-invasive brain recordings, work on recovering continuous visual experiences in the form of videos is limited. In this work, we propose Mind-Video that learns spatiotemporal information from continuous fMRI data of the cerebral cortex progressively through masked brain modeling, multimodal contrastive learning with spatiotemporal attention, and co-training with an augmented Stable Diffusion model that incorporates network temporal inflation. We show that high-quality videos of arbitrary frame rates can be reconstructed with Mind-Video using adversarial guidance. The recovered videos were evaluated with various semantic and pixel-level metrics. We achieved an average accuracy of 85% in semantic classification tasks and 0.19 in structural similarity index (SSIM), outperforming the previous state-of-the-art by 45%. We also show that our model is biologically plausible and interpretable, reflecting established physiological processes.
翻译:从大脑活动中重建人类视觉一直是帮助理解认知过程的引人入胜的任务。尽管近期研究在从非侵入性脑记录重建静态图像方面取得了巨大成功,但以视频形式恢复连续视觉体验的工作仍然有限。在本工作中,我们提出了Mind-Video,该方法通过掩码脑建模、结合时空注意力的多模态对比学习,以及与引入网络时间膨胀的增强版Stable Diffusion模型进行协同训练,逐步从大脑皮层连续fMRI数据中学习时空信息。我们展示了使用对抗性引导的Mind-Video能够重建任意帧率的高质量视频。通过多种语义和像素级指标评估恢复视频,我们在语义分类任务中达到85%的平均准确率,结构相似性指数(SSIM)达到0.19,较先前最优方法提升45%。我们还证明该模型具有生物学合理性和可解释性,反映了已建立的生理过程。