Neural visual decoding is a central problem in brain computer interface research, aiming to reconstruct human visual perception and to elucidate the structure of neural representations. However, existing approaches overlook a fundamental granularity mismatch between human and machine vision, where deep vision models emphasize semantic invariance by suppressing local texture information, whereas neural signals preserve an intricate mixture of low-level visual attributes and high-level semantic content. To address this mismatch, we propose Shallow Alignment, a novel contrastive learning strategy that aligns neural signals with intermediate representations of visual encoders rather than their final outputs, thereby striking a better balance between low-level texture details and high-level semantic features. Extensive experiments across multiple benchmarks demonstrate that Shallow Alignment significantly outperforms standard final-layer alignment, with performance gains ranging from 22% to 58% across diverse vision backbones. Notably, our approach effectively unlocks the scaling law in neural visual decoding, enabling decoding performance to scale predictably with the capacity of pre-trained vision backbones. We further conduct systematic empirical analyses to shed light on the mechanisms underlying the observed performance gains.
翻译:神经视觉解码是脑机接口研究中的核心问题,旨在重建人类视觉感知并阐明神经表征的结构。然而,现有方法忽视了一个根本性的粒度不匹配问题:人类视觉与机器视觉之间存在差异,其中深度视觉模型通过抑制局部纹理信息来强调语义不变性,而神经信号则保留了低层视觉属性与高层语义内容的复杂混合。为解决这一不匹配问题,我们提出了浅层对齐,这是一种新颖的对比学习策略,它将神经信号与视觉编码器的中间表征而非其最终输出进行对齐,从而在低层纹理细节与高层语义特征之间实现更好的平衡。在多个基准测试上进行的大量实验表明,浅层对齐显著优于标准的最终层对齐方法,在不同视觉骨干网络上的性能提升范围从22%到58%。值得注意的是,我们的方法有效地解锁了神经视觉解码中的缩放定律,使得解码性能能够随着预训练视觉骨干网络容量的增加而可预测地提升。我们进一步进行了系统的实证分析,以阐明所观察到的性能提升背后的机制。