Reconstruction of static visual stimuli from non-invasion brain activity fMRI achieves great success, owning to advanced deep learning models such as CLIP and Stable Diffusion. However, the research on fMRI-to-video reconstruction remains limited since decoding the spatiotemporal perception of continuous visual experiences is formidably challenging. We contend that the key to addressing these challenges lies in accurately decoding both high-level semantics and low-level perception flows, as perceived by the brain in response to video stimuli. To the end, we propose NeuroClips, an innovative framework to decode high-fidelity and smooth video from fMRI. NeuroClips utilizes a semantics reconstructor to reconstruct video keyframes, guiding semantic accuracy and consistency, and employs a perception reconstructor to capture low-level perceptual details, ensuring video smoothness. During inference, it adopts a pre-trained T2V diffusion model injected with both keyframes and low-level perception flows for video reconstruction. Evaluated on a publicly available fMRI-video dataset, NeuroClips achieves smooth high-fidelity video reconstruction of up to 6s at 8FPS, gaining significant improvements over state-of-the-art models in various metrics, e.g., a 128% improvement in SSIM and an 81% improvement in spatiotemporal metrics. Our project is available at https://github.com/gongzix/NeuroClips.
翻译:基于非侵入式脑活动功能磁共振成像(fMRI)的静态视觉刺激重建已取得显著成功,这得益于CLIP和Stable Diffusion等先进深度学习模型。然而,由于解码连续视觉体验的时空感知极具挑战性,针对fMRI到视频重建的研究仍十分有限。我们认为,解决这些挑战的关键在于准确解码大脑响应视频刺激时所感知到的高层语义与低层感知流。为此,我们提出NeuroClips——一种创新的框架,用于从fMRI解码高保真且流畅的视频。NeuroClips利用语义重建器重构视频关键帧以引导语义准确性与一致性,并采用感知重建器捕获低层感知细节以确保视频流畅度。在推理阶段,该方法采用预训练的文本到视频(T2V)扩散模型,同时注入关键帧与低层感知流以进行视频重建。在公开可用的fMRI-视频数据集上的评估表明,NeuroClips能够以8帧/秒的速率实现长达6秒的流畅高保真视频重建,在多项指标上显著优于现有最优模型,例如结构相似性(SSIM)提升128%,时空度量提升81%。项目代码已开源:https://github.com/gongzix/NeuroClips。