Recent advances in fMRI-based image reconstruction have achieved remarkable photo-realistic fidelity. Yet, a persistent limitation remains: while reconstructed images often appear naturalistic and holistically similar to the target stimuli, they frequently suffer from severe semantic misalignment -- salient objects are often replaced or hallucinated despite high visual quality. In this work, we address this limitation by rethinking the role of explicit semantic interpretation in fMRI decoding. We argue that existing methods rely too heavily on entangled visual embeddings which prioritize low-level appearance cues -- such as texture and global gist -- over explicit semantic identity. To overcome this, we parse fMRI signals into rich, sentence-level semantic descriptions that mirror the hierarchical and compositional nature of human visual understanding. We achieve this by leveraging grounded VLMs to generate synthetic, human-like, multi-granularity textual representations that capture object identities and spatial organization. Built upon this foundation, we propose SynMind, a framework that integrates these explicit semantic encodings with visual priors to condition a pretrained diffusion model. Extensive experiments demonstrate that SynMind outperforms state-of-the-art methods across most quantitative metrics. Notably, by offloading semantic reasoning to our text-alignment module, SynMind surpasses competing methods based on SDXL while using the much smaller Stable Diffusion 1.4 and a single consumer GPU. Large-scale human evaluations further confirm that SynMind produces reconstructions more consistent with human visual perception. Neurovisualization analyses reveal that SynMind engages broader and more semantically relevant brain regions, mitigating the over-reliance on high-level visual areas.
翻译:基于功能磁共振成像(fMRI)的图像重建技术近期取得了显著进展,实现了令人瞩目的照片级真实感。然而,一个持续的局限性依然存在:尽管重建图像通常看起来自然且在整体上与目标刺激相似,但它们常常遭受严重的语义错位——尽管视觉质量很高,显著物体却经常被替换或产生幻觉。在本工作中,我们通过重新思考显式语义解释在fMRI解码中的作用来解决这一局限性。我们认为,现有方法过于依赖纠缠的视觉嵌入,这些嵌入优先考虑低层次外观线索(如纹理和整体要旨)而非显式的语义身份。为克服此问题,我们将fMRI信号解析为丰富的、句子级别的语义描述,以反映人类视觉理解的分层和组合特性。我们通过利用接地视觉语言模型生成合成的、类人的、多粒度的文本表征来实现这一点,这些表征捕捉了物体身份和空间组织。基于此基础,我们提出了SynMind框架,该框架将这些显式语义编码与视觉先验相结合,以调节一个预训练的扩散模型。大量实验表明,SynMind在大多数定量指标上优于现有最先进方法。值得注意的是,通过将语义推理卸载到我们的文本对齐模块,SynMind在使用更小的Stable Diffusion 1.4和单个消费级GPU的情况下,超越了基于SDXL的竞争方法。大规模的人类评估进一步证实,SynMind产生的重建结果与人类视觉感知更为一致。神经可视化分析揭示,SynMind激活了更广泛且语义相关性更强的大脑区域,减轻了对高级视觉区域的过度依赖。