Decoding visual representations from brain signals has attracted significant attention in both neuroscience and artificial intelligence. However, the degree to which brain signals truly encode visual information remains unclear. Current visual decoding approaches explore various brain-image alignment strategies, yet most emphasize high-level semantic features while neglecting pixel-level details, thereby limiting our understanding of the human visual system. In this paper, we propose a brain-image alignment strategy that leverages multiple pre-trained visual encoders with distinct inductive biases to capture hierarchical and multi-scale visual representations, while employing a contrastive learning objective to achieve effective alignment between brain signals and visual embeddings. Furthermore, we introduce a Fusion Prior, which learns a stable mapping on large-scale visual data and subsequently matches brain features to this pre-trained prior, thereby enhancing distributional consistency across modalities. Extensive quantitative and qualitative experiments demonstrate that our method achieves a favorable balance between retrieval accuracy and reconstruction fidelity.
翻译:从脑信号中解码视觉表征已在神经科学与人工智能领域引起广泛关注。然而,脑信号在多大程度上真实编码视觉信息仍不明确。当前的视觉解码方法探索了多种脑-图像对齐策略,但大多侧重于高层语义特征而忽略了像素级细节,从而限制了对人类视觉系统的理解。本文提出一种脑-图像对齐策略,该策略利用多个具有不同归纳偏置的预训练视觉编码器来捕获层次化、多尺度的视觉表征,同时采用对比学习目标以实现脑信号与视觉嵌入之间的有效对齐。此外,我们引入一种融合先验,该先验在大规模视觉数据上学习稳定的映射关系,随后将脑特征与该预训练先验进行匹配,从而增强跨模态的分布一致性。大量的定量与定性实验表明,我们的方法在检索准确度与重建保真度之间取得了良好的平衡。