Brain Captioning: Decoding human brain activity into images and text

Every day, the human brain processes an immense volume of visual information, relying on intricate neural mechanisms to perceive and interpret these stimuli. Recent breakthroughs in functional magnetic resonance imaging (fMRI) have enabled scientists to extract visual information from human brain activity patterns. In this study, we present an innovative method for decoding brain activity into meaningful images and captions, with a specific focus on brain captioning due to its enhanced flexibility as compared to brain decoding into images. Our approach takes advantage of cutting-edge image captioning models and incorporates a unique image reconstruction pipeline that utilizes latent diffusion models and depth estimation. We utilized the Natural Scenes Dataset, a comprehensive fMRI dataset from eight subjects who viewed images from the COCO dataset. We employed the Generative Image-to-text Transformer (GIT) as our backbone for captioning and propose a new image reconstruction pipeline based on latent diffusion models. The method involves training regularized linear regression models between brain activity and extracted features. Additionally, we incorporated depth maps from the ControlNet model to further guide the reconstruction process. We evaluate our methods using quantitative metrics for both generated captions and images. Our brain captioning approach outperforms existing methods, while our image reconstruction pipeline generates plausible images with improved spatial relationships. In conclusion, we demonstrate significant progress in brain decoding, showcasing the enormous potential of integrating vision and language to better understand human cognition. Our approach provides a flexible platform for future research, with potential applications in various fields, including neural art, style transfer, and portable devices.

翻译：人脑每日处理海量视觉信息，依赖精密的神经机制感知并解读这些刺激。功能性磁共振成像（fMRI）领域的最新突破使科学家能够从人脑活动模式中提取视觉信息。本研究提出一种创新性方法，可将脑活动解码为具有语义意义的图像与字幕；鉴于脑字幕相较于脑图像解码具备更强的灵活性，本研究特别聚焦于脑字幕技术。我们利用前沿的图像字幕模型，并融合了基于潜在扩散模型与深度估计的独特图像重建流程。研究采用自然场景数据集（Natural Scenes Dataset），该数据集包含八名受试者在观看COCO数据集图像时的全面fMRI记录。我们以生成式图像转文本Transformer（Generative Image-to-text Transformer, GIT）作为字幕生成的骨干网络，并提出基于潜在扩散模型的新型图像重建流程。该方法通过在脑活动与提取特征之间训练正则化线性回归模型实现解码。此外，我们整合了ControlNet模型生成的深度图以进一步引导重建过程。通过量化指标对生成字幕与图像进行评估，结果表明：脑字幕方法显著优于现有技术，而图像重建流程能生成空间关系更合理的拟真图像。最终，本研究展现了脑解码领域的重大进展，揭示了视觉-语言融合在深化人类认知理解方面的巨大潜力。该方法为未来研究提供了灵活平台，可广泛应用于神经艺术、风格迁移及便携式设备等跨领域场景。