EEG-FM-Bench: A Comprehensive Benchmark for the Systematic Evaluation and Diagnostic Analyses of EEG Foundation Models

Electroencephalography foundation models (EEG-FMs) have advanced brain signal analysis, but the lack of standardized evaluation benchmarks impedes model comparison and scientific progress. Current evaluations rely on inconsistent protocols that render cross-model comparisons unreliable, while a lack of diagnostic analyses obscures the internal mechanisms driving transfer efficiency and scaling behaviors. To address this, we introduce \textbf{EEG-FM-Bench}, a unified system for the standardized evaluation of EEG-FMs. The benchmark integrates 14 datasets across 10 paradigms and incorporates diverse experimental settings, including multiple fine-tuning strategies, task organizations, and classifier configurations, supported by tools for gradient and representation analysis. Our experiments and analysis reveal several critical insights: (1) multi-task learning often acts as a useful regularizer that mitigates overfitting in data-scarce EEG contexts, although negative transfer can arise under specific task paradigms; (2) pre-training efficiency is currently limited by gradient conflicts between reconstruction objectives and downstream tasks; (3) under released checkpoints and a matched downstream protocol, model or data scale alone does not fully explain transfer performance, while objective alignment, adaptation compatibility, and EEG-specific design appear to be important factors. This benchmark enables fair comparison and reproducible analysis, providing a step toward fairer comparison and more interpretable analysis of EEG-FMs. Code is available at https://github.com/xw1216/EEG-FM-Bench.

翻译：脑电图基础模型（EEG-FMs）推动了脑信号分析的发展，但缺乏标准化评估基准阻碍了模型比较与科学进步。当前评估依赖不一致的协议，导致跨模型比较不可靠，同时缺乏诊断分析来揭示驱动迁移效率与缩放行为的内部机制。为解决这一问题，我们提出**EEG-FM-Bench**，一个用于标准化评估EEG-FMs的统一系统。该基准整合了涵盖10种范式的14个数据集，并融入多样化的实验设置，包括多种微调策略、任务组织方式和分类器配置，辅以梯度与表征分析工具。我们的实验与分析揭示了若干关键见解：（1）多任务学习常作为一种有效的正则化手段，缓解数据匮乏的EEG场景中的过拟合问题，但特定任务范式下可能出现负迁移；（2）当前预训练效率受限于重构目标与下游任务间的梯度冲突；（3）在已发布检查点与匹配的后续协议下，模型或数据规模本身无法完全解释迁移性能，而目标对齐、适应兼容性及EEG专用设计似乎是重要因素。该基准可实现公平比较与可重复分析，为EEG-FMs更公平的比较与更具可解释性的分析奠定基础。代码见https://github.com/xw1216/EEG-FM-Bench。