Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

William Lehn-Schiøler,Magnus Ruud Kjær,Rahul Thapa,Magnus Guldberg Pedersen,Anton Mosquera Storgaard,Nick Williams,Radu Gatej,Tue Lehn-Schiøler,Andreas Brink-Kjær,Sadasivan Puthusserypady,Sándor Beniczky,James Zou,Lars Kai Hansen

from arxiv, Preprint. 14 pages, 7 figures, 4 tables

EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.

翻译：脑电图基础模型在临床任务中已取得最先进性能，但其预测所依赖的内部计算机制仍具不透明性，这构成了临床信任的障碍。本研究采用TopK稀疏自编码器对三种架构各异的脑电图Transformer模型——SleepFM、REVE和LaBraM——从其嵌入特征中提取稀疏特征字典。通过将特征锚定至临床分类体系（异常、年龄、性别和用药情况），我们跨架构基准评估了特征的单语义性与纠缠程度。基于内禀字典健康审计的单超参数流程可在三种架构间稳健迁移。通过概念引导技术，我们提出"目标vs.非目标"探测区域量化指标以评估引导选择性，并揭示三种操作范式：可选择性引导、编码但纠缠、未编码。该框架暴露了关键表征缺陷：导致全局模型性能崩溃的"破城锤"式干预，以及临床特征纠缠——例如年龄与病理混淆，使得抑制任一概念必然导致另一概念受损。最后，频谱解码器将这些干预映射回振幅谱，将潜在操作转化为可生理解读的频率特征，如病理性慢波抑制与α频带恢复。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

稀疏自编码器综述：解释大语言模型的内部机制

专知会员服务

17+阅读 · 2025年12月27日

清华朱文武团队图机器学习新进展：首个自动图基础模型

专知会员服务

9+阅读 · 2025年8月31日

【ICML2025】《基于低分辨率词元枢轴的层级掩码自回归模型》

专知会员服务

7+阅读 · 2025年5月27日

【MIT博士论文】迈向人工神经科学：语言模型可解释性分析方法

专知会员服务

28+阅读 · 2025年4月1日