Reconstructing natural visual scenes from neural activity is a key challenge in neuroscience and computer vision. We present SpikeVAEDiff, a novel two-stage framework that combines a Very Deep Variational Autoencoder (VDVAE) and the Versatile Diffusion model to generate high-resolution and semantically meaningful image reconstructions from neural spike data. In the first stage, VDVAE produces low-resolution preliminary reconstructions by mapping neural spike signals to latent representations. In the second stage, regression models map neural spike signals to CLIP-Vision and CLIP-Text features, enabling Versatile Diffusion to refine the images via image-to-image generation. We evaluate our approach on the Allen Visual Coding-Neuropixels dataset and analyze different brain regions. Our results show that the VISI region exhibits the most prominent activation and plays a key role in reconstruction quality. We present both successful and unsuccessful reconstruction examples, reflecting the challenges of decoding neural activity. Compared with fMRI-based approaches, spike data provides superior temporal and spatial resolution. We further validate the effectiveness of the VDVAE model and conduct ablation studies demonstrating that data from specific brain regions significantly enhances reconstruction performance.
翻译:从神经活动中重建自然视觉场景是神经科学与计算机视觉领域的关键挑战。本文提出SpikeVAEDiff,一种新颖的两阶段框架,通过结合极深变分自编码器(VDVAE)与Versatile Diffusion模型,从神经脉冲数据生成高分辨率且语义丰富的图像重建结果。在第一阶段,VDVAE通过将神经脉冲信号映射到隐表征,生成低分辨率的初步重建图像。在第二阶段,回归模型将神经脉冲信号映射至CLIP-Vision与CLIP-Text特征,使Versatile Diffusion能够通过图像到图像生成方式对图像进行精细化处理。我们在Allen Visual Coding-Neuropixels数据集上评估了所提方法,并分析了不同脑区的作用。实验结果表明,VISI区域表现出最显著的激活,并对重建质量起关键作用。我们展示了成功与不成功的重建案例,反映了解码神经活动所面临的挑战。与基于fMRI的方法相比,脉冲数据具有更优的时间与空间分辨率。我们进一步验证了VDVAE模型的有效性,并通过消融实验证明特定脑区的数据能显著提升重建性能。