EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.
翻译:脑电图基础模型在临床任务中达到了顶尖性能,但其驱动预测的内部计算机制仍不透明,这构成了临床信任的障碍。我们采用TopK稀疏自编码器(SAEs)对三种架构各异的脑电图Transformer模型——SleepFM、REVE和LaBraM——进行处理,从其嵌入层中提取稀疏特征字典。通过将这些特征锚定在临床分类体系(异常性、年龄、性别和用药)中,我们跨架构基准测试了其单义性和纠缠性。一种由内在字典健康审计驱动的单一超参数流程,能够稳健地迁移至所有三种架构。通过概念引导,我们引入了"靶向与非靶向"探针区域指标来量化引导选择性,并揭示了三种操作模式:可选择性引导、可编码但纠缠、以及未编码。该框架暴露了关键的表征失败:"破球式"干预会彻底破坏全局模型性能;临床纠缠(如年龄-病理混杂性)导致无法在不破坏另一概念的情况下抑制某一概念。最后,频谱解码器将这些干预映射至振幅谱,将潜在操控转化为生理学可解释的频率特征,例如病理慢波抑制与α波段恢复。