Electroencephalography foundation models (EEG-FMs) have advanced brain signal analysis, but the lack of standardized evaluation benchmarks impedes model comparison and scientific progress. Current evaluations rely on inconsistent protocols that render cross-model comparisons unreliable, while a lack of diagnostic analyses obscures the internal mechanisms driving transfer efficiency and scaling behaviors. To address this, we introduce \textbf{EEG-FM-Bench}, a unified system for the standardized evaluation of EEG-FMs. The benchmark integrates 14 datasets across 10 paradigms and incorporates diverse experimental settings, including multiple fine-tuning strategies, task organizations, and classifier configurations, supported by tools for gradient and representation analysis. Our experiments and analysis reveal several critical insights: (1) multi-task learning acts as a critical regularizer to mitigate overfitting in data-scarce EEG contexts; (2) pre-training efficiency is currently limited by gradient conflicts between reconstruction objectives and downstream tasks; (3) model scaling deviates from typical laws, as compact architectures with domain-specific inductive biases consistently outperform significantly larger models. This benchmark enables fair comparison and reproducible analysis, shifting the field from fragmented results to interpretable advances. Code is available at https://github.com/xw1216/EEG-FM-Bench.
翻译:脑电图基础模型(EEG-FMs)推动了脑信号分析的发展,但缺乏标准化的评估基准阻碍了模型比较与科学进展。当前的评估依赖于不一致的协议,导致跨模型比较不可靠,同时诊断分析的缺失也掩盖了驱动迁移效率与缩放行为的内部机制。为解决此问题,我们引入了 \textbf{EEG-FM-Bench},一个用于标准化评估 EEG-FMs 的统一系统。该基准整合了涵盖 10 种范式的 14 个数据集,并纳入了多样化的实验设置,包括多种微调策略、任务组织方式和分类器配置,并辅以梯度和表征分析工具支持。我们的实验与分析揭示了若干关键发现:(1)多任务学习作为一种关键的正则化器,可缓解数据稀缺的 EEG 场景中的过拟合;(2)当前预训练效率受限于重构目标与下游任务之间的梯度冲突;(3)模型缩放行为偏离典型规律,具有领域特定归纳偏置的紧凑架构持续显著优于大得多的模型。此基准支持公平比较与可复现分析,推动该领域从零散的结果转向可解释的进展。代码发布于 https://github.com/xw1216/EEG-FM-Bench。