Quantifying the Generalization Gap in Seizure Detection: A Large-Scale Empirical Benchmark via the SzCORE Challenge

Reliable automatic seizure detection from long-term electroencephalography (EEG) remains an unsolved challenge, as current models often fail to generalize across patients or clinical settings. Manual EEG review still is the standard of care, highlighting the need for robust models and standardized evaluation. The current literature often reports high efficacy, yet these models frequently fail when deployed to unseen patient populations. To rigorously assess this generalization gap, we conducted a large-scale empirical study evaluating 28 state-of-the-art algorithmic architectures, ranging from classical feature engineering to modern Deep Learning. These algorithms were collected by organizing a competition. A strictly held-out private dataset of continuous EEG recordings from 65 subjects, totaling 4,360 hours of data, was utilized to evaluate algorithm performance. Expert neurophysiologists annotated these recordings, establishing the ground truth for seizure events. Algorithms were evaluated using event-based metrics from the SzCORE framework, including sensitivity, precision, F1-score, and false positive rate per day. Results revealed significant performance variability among state-of-the-art approaches, with the top F1 score of 32% (sensitivity 37%, precision 29%), highlighting the persistent difficulty of this task. Analysis uncovered a discordance between peak performance and population-level stability. The algorithms achieving the highest aggregate F1-scores did not achieve the most consistent ranking across subjects. This independent evaluation exposed a notable gap between self-reported efficacies and hold-out performance, underscoring the critical need for standardized, rigorous benchmarking. The evaluation infrastructure transitions into a continuously open benchmarking platform, fostering reproducible research and accelerating robust seizure detection algorithm development.

翻译：从长程脑电图（EEG）中实现可靠的自动癫痫检测仍是一项未解决的挑战，因为当前模型往往无法在患者或临床场景间泛化。手动EEG审查仍是护理标准，这凸显了对稳健模型与标准化评估的需求。现有文献常报告高有效性，但这些模型在部署至未见过的患者人群时频繁失效。为严格评估这一泛化差距，我们开展了一项大规模实证研究，评估了28种最先进的算法架构，涵盖从经典特征工程到现代深度学习的方法。这些算法通过组织竞赛收集而来。研究采用严格保留的隐私数据集，包含65名受试者的连续EEG记录（总计4360小时数据），用于评估算法性能。专家神经生理学家对这些记录进行标注，建立癫痫事件的金标准。算法基于SzCORE框架的事件级指标进行评估，包括敏感性、精确率、F1分数与每日假阳性率。结果显示，最先进方法间存在显著性能差异，最佳F1分数为32%（敏感性37%，精确率29%），凸显了该任务的持续难度。分析揭示了峰值性能与人群稳定性之间的不一致性——达到最高总体F1分数的算法并未在受试者间实现最一致的排名。这项独立评估揭示了自我报告效能与留出性能之间的显著差距，强调了标准化、严格基准测试的迫切需求。该评估基础设施已转型为持续开放的基准测试平台，旨在促进可重复研究并加速稳健癫痫检测算法的开发。