Recent success in natural language processing has motivated growing interest in large-scale foundation models for neuroimaging data. Such models often require discretization of continuous neural time series data, a process referred to as 'tokenization'. However, the impact of different tokenization strategies for neural data is currently poorly understood. In this work, we present a systematic evaluation of sample-level tokenization strategies for transformer-based large neuroimaging models (LNMs) applied to magnetoencephalography (MEG) data. We compare learnable and non-learnable tokenizers by examining their signal reconstruction fidelity and their impact on subsequent foundation modeling performance (token prediction, biological plausibility of generated data, preservation of subject-specific information, and performance on downstream tasks). For the learnable tokenizer, we introduce a novel approach based on an autoencoder. Experiments were conducted on three publicly available MEG datasets spanning different acquisition sites, scanners, and experimental paradigms. Our results show that both learnable and non-learnable discretization schemes achieve high reconstruction accuracy and broadly comparable performance across most evaluation criteria, suggesting that simple fixed sample-level tokenization strategies can be used in the development of neural foundation models. The code is available at https://github.com/OHBA-analysis/Cho2026_Tokenizer.
翻译:自然语言处理领域的近期成功激发了人们对神经影像数据大规模基础模型日益增长的兴趣。此类模型通常需要对连续神经时间序列数据进行离散化处理,这一过程被称为"标记化"。然而,目前对于神经数据不同标记化策略的影响尚缺乏深入理解。在本研究中,我们对应用于脑磁图数据的基于Transformer的大型神经影像模型的样本级标记化策略进行了系统评估。我们通过考察可学习与不可学习标记器的信号重建保真度及其对后续基础建模性能的影响,对二者进行了比较。对于可学习标记器,我们提出了一种基于自编码器的新方法。实验在三个公开可用的MEG数据集上进行,这些数据集涵盖了不同的采集站点、扫描设备和实验范式。我们的结果表明,可学习和不可学习的离散化方案均能实现较高的重建精度,并且在大多数评估标准上表现出广泛可比性能,这表明在神经基础模型开发中可以采用简单的固定样本级标记化策略。代码可在https://github.com/OHBA-analysis/Cho2026_Tokenizer获取。