Reconstructing human dynamic visual perception from electroencephalography (EEG) signals is of great research significance since EEG's non-invasiveness and high temporal resolution. However, EEG-to-video reconstruction remains challenging due to: 1) Single Modality: existing studies solely align EEG signals with the text modality, which ignores other modalities and are prone to suffer from overfitting problems; 2) Data Scarcity: current methods often have difficulty training to converge with limited EEG-video data. To solve the above problems, we propose a novel framework MindCine to achieve high-fidelity video reconstructions on limited data. We employ a multimodal joint learning strategy to incorporate beyond-text modalities in the training stage and leverage a pre-trained large EEG model to relieve the data scarcity issue for decoding semantic information, while a Seq2Seq model with causal attention is specifically designed for decoding perceptual information. Extensive experiments demonstrate that our model outperforms state-of-the-art methods both qualitatively and quantitatively. Additionally, the results underscore the effectiveness of the complementary strengths of different modalities and demonstrate that leveraging a large-scale EEG model can further enhance reconstruction performance by alleviating the challenges associated with limited data.
翻译:从脑电图(EEG)信号重建人类动态视觉感知具有重要的研究意义,这得益于EEG的非侵入性和高时间分辨率。然而,EEG到视频的重建仍然面临挑战,原因在于:1)单模态性:现有研究仅将EEG信号与文本模态对齐,忽略了其他模态,且容易出现过拟合问题;2)数据稀缺性:当前方法在有限的EEG-视频数据下训练往往难以收敛。为解决上述问题,我们提出了一种新颖的框架MindCine,以在有限数据上实现高保真度的视频重建。我们采用多模态联合学习策略,在训练阶段融入文本以外的模态,并利用预训练的大型EEG模型来缓解解码语义信息时的数据稀缺问题,同时专门设计了一个具有因果注意力的Seq2Seq模型来解码感知信息。大量实验表明,我们的模型在定性和定量评估上均优于现有最先进方法。此外,结果强调了不同模态互补优势的有效性,并证明利用大规模EEG模型可以通过缓解与数据有限相关的挑战来进一步提升重建性能。