Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

William Lehn-Schiøler,Magnus Ruud Kjær,Rahul Thapa,Magnus Guldberg Pedersen,Anton Mosquera Storgaard,Nick Williams,Radu Gatej,Tue Lehn-Schiøler,Sándor Beniczky,Sadasivan Puthusserypady,James Zou,Lars Kai Hansen

from arxiv, Preprint. 14 pages, 7 figures, 4 tables

EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.

翻译：脑电图基础模型在临床任务中达到了顶尖性能，但其驱动预测的内部计算机制仍不透明，这构成了临床信任的障碍。我们采用TopK稀疏自编码器（SAEs）对三种架构各异的脑电图Transformer模型——SleepFM、REVE和LaBraM——进行处理，从其嵌入层中提取稀疏特征字典。通过将这些特征锚定在临床分类体系（异常性、年龄、性别和用药）中，我们跨架构基准测试了其单义性和纠缠性。一种由内在字典健康审计驱动的单一超参数流程，能够稳健地迁移至所有三种架构。通过概念引导，我们引入了"靶向与非靶向"探针区域指标来量化引导选择性，并揭示了三种操作模式：可选择性引导、可编码但纠缠、以及未编码。该框架暴露了关键的表征失败："破球式"干预会彻底破坏全局模型性能；临床纠缠（如年龄-病理混杂性）导致无法在不破坏另一概念的情况下抑制某一概念。最后，频谱解码器将这些干预映射至振幅谱，将潜在操控转化为生理学可解释的频率特征，例如病理慢波抑制与α波段恢复。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

稀疏自编码器综述：解释大语言模型的内部机制

专知会员服务

17+阅读 · 2025年12月27日

【MIT博士论文】迈向人工神经科学：语言模型可解释性分析方法

专知会员服务

28+阅读 · 2025年4月1日

多模态基础模型的机制可解释性综述

专知会员服务

43+阅读 · 2025年2月28日

【博士论文】理解大型语言模型：使用探针分类器和自合理化实现严格和有针对性的可解释性，109页pdf

专知会员服务

40+阅读 · 2024年4月14日