Existing approaches to modeling associations between visual stimuli and brain responses are facing difficulties in handling between-subject variance and model generalization. Inspired by the recent progress in modeling speech-brain response, we propose in this work a ``match-vs-mismatch'' deep learning model to classify whether a video clip induces excitatory responses in recorded EEG signals and learn associations between the visual content and corresponding neural recordings. Using an exclusive experimental dataset, we demonstrate that the proposed model is able to achieve the highest accuracy on unseen subjects as compared to other baseline models. Furthermore, we analyze the inter-subject noise using a subject-level silhouette score in the embedding space and show that the developed model is able to mitigate inter-subject noise and significantly reduce the silhouette score. Moreover, we examine the Grad-CAM activation score and show that the brain regions associated with language processing contribute most to the model predictions, followed by regions associated with visual processing. These results have the potential to facilitate the development of neural recording-based video reconstruction and its related applications.
翻译:现有建模视觉刺激与脑反应关联的方法在处理个体间差异和模型泛化方面面临困难。受近期语音-脑反应建模进展的启发,本文提出一种"匹配-不匹配"深度学习模型,用于分类视频片段是否在记录的脑电信号中诱发兴奋性反应,并学习视觉内容与相应神经记录之间的关联。基于专属实验数据集,我们证明相比其他基线模型,所提模型能在未见过的被试上实现最高准确率。进一步,我们利用嵌入空间中基于被试水平的轮廓系数分析个体间噪声,表明所开发模型能有效抑制个体间噪声并显著降低轮廓系数。此外,通过考察Grad-CAM激活分数,我们发现与语言处理相关的脑区对模型预测贡献最大,其次是与视觉处理相关的脑区。这些结果有望促进基于神经记录的视频重建及其相关应用的发展。