Non-invasive brain-computer interfaces suffer severe fidelity degradation in neural visual decoding when generalizing to natural visual experiences. Conventional multimodal contrastive representation learning solely optimizes geometric distance alignment, neglecting semantic consistency and subject selectivity, causing spurious zero-shot alignment. We propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) Semantic-entity Aware Visual Encoder (SAVE), learning spatial attention to extract semantic content without pre-trained saliency models; (2 Unified EEG Enhancer (UEE), employing multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) Prototype-based Progressive Augmenter (PPA), maintaining an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, surpassing state-of-the-art methods. Code is available at https://github.com/NZWANG/SUP-MCRL.
翻译:非侵入式脑机接口在泛化到自然视觉体验时,其神经视觉解码的保真度会严重下降。传统的多模态对比表示学习仅优化几何距离对齐,忽视了语义一致性和主体选择性,导致产生虚假的零样本对齐。我们提出SUP-MCRL,一个集成三种协同机制的统一框架:(1)语义实体感知视觉编码器(SAVE),通过学习空间注意力提取语义内容,无需预训练的显著性模型;(2)统一脑电增强器(UEE),采用多尺度空洞卷积和频带间注意力实现自适应跨主体鲁棒性;(3)基于原型的渐进式增强器(PPA),维护一个指数移动平均(EMA)更新的伪特征池以防止表示坍缩。在THINGS-EEG数据集上的零样本实验实现了66.0%/91.9%(Top-1/Top-5)的受试者内准确率和24.0%/52.9%的留一主体交叉(LOSO)准确率,超越了当前最优方法。代码可在https://github.com/NZWANG/SUP-MCRL获取。