Decoding non-invasive cognitive signals to natural language has long been the goal of building practical brain-computer interfaces (BCIs). Recent major milestones have successfully decoded cognitive signals like functional Magnetic Resonance Imaging (fMRI) and electroencephalogram (EEG) into text under open vocabulary setting. However, how to split the datasets for training, validating, and testing in cognitive signal decoding task still remains controversial. In this paper, we conduct systematic analysis on current dataset splitting methods and find the existence of data contamination largely exaggerates model performance. Specifically, first we find the leakage of test subjects' cognitive signals corrupts the training of a robust encoder. Second, we prove the leakage of text stimuli causes the auto-regressive decoder to memorize information in test set. The decoder generates highly accurate text not because it truly understands cognitive signals. To eliminate the influence of data contamination and fairly evaluate different models' generalization ability, we propose a new splitting method for different types of cognitive datasets (e.g. fMRI, EEG). We also test the performance of SOTA Brain-to-Text decoding models under the proposed dataset splitting paradigm as baselines for further research.
翻译:将非侵入式认知信号解码为自然语言一直是构建实用脑机接口(BCI)的长期目标。近年来,重大突破已成功将功能性磁共振成像(fMRI)和脑电图(EEG)等认知信号在开放词汇设置下解码为文本。然而,在认知信号解码任务中如何划分训练集、验证集和测试集仍存在争议。本文对当前数据集划分方法进行了系统分析,发现数据污染的存在极大地夸大了模型性能。具体而言:首先,我们发现测试受试者认知信号的泄露破坏了鲁棒编码器的训练;其次,我们证明文本刺激的泄露导致自回归解码器记忆测试集信息,其生成高精度文本并非真正理解认知信号。为消除数据污染影响并公平评估不同模型的泛化能力,我们针对不同类型认知数据集(如fMRI、EEG)提出了一种新的划分方法。同时,我们基于所提数据划分范式测试了当前最优脑到文本解码模型的性能,为后续研究提供基准。