Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single neural embedding, using it as static guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67$\times$ faster inference, and more deterministic results than the diffusion-based baselines.
翻译:从fMRI信号重建视觉刺激是连接机器学习与神经科学的核心挑战。现有基于扩散的方法通常将fMRI活动映射至单一神经嵌入,并将其作为整个生成过程中的静态指导。然而,这种固定指导会压缩层级化神经信息,且与图像重建的阶段依赖性需求不匹配。为此,我们提出MindHier——一种基于尺度级自回归建模的由粗到精fMRI图像重建框架。MindHier包含三个组件:层级化fMRI编码器提取多级神经嵌入,层级到层级对齐方案强制执行与CLIP特征的逐层对应,以及尺度感知的由粗到精神经指导策略将嵌入注入匹配尺度的自回归过程。这些设计使MindHier成为扩散方法的高效且认知对齐的替代方案,通过实现层级化重建过程——先合成全局语义再细化局部细节,类似于人类视觉感知。在NSD数据集上的大量实验表明,相比扩散基线,MindHier实现了更优的语义保真度、4.67倍更快的推理速度以及更确定的结果。