EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.
翻译:脑电图基础模型在临床任务中已取得最先进性能,但其预测所依赖的内部计算机制仍具不透明性,这构成了临床信任的障碍。本研究采用TopK稀疏自编码器对三种架构各异的脑电图Transformer模型——SleepFM、REVE和LaBraM——从其嵌入特征中提取稀疏特征字典。通过将特征锚定至临床分类体系(异常、年龄、性别和用药情况),我们跨架构基准评估了特征的单语义性与纠缠程度。基于内禀字典健康审计的单超参数流程可在三种架构间稳健迁移。通过概念引导技术,我们提出"目标vs.非目标"探测区域量化指标以评估引导选择性,并揭示三种操作范式:可选择性引导、编码但纠缠、未编码。该框架暴露了关键表征缺陷:导致全局模型性能崩溃的"破城锤"式干预,以及临床特征纠缠——例如年龄与病理混淆,使得抑制任一概念必然导致另一概念受损。最后,频谱解码器将这些干预映射回振幅谱,将潜在操作转化为可生理解读的频率特征,如病理性慢波抑制与α频带恢复。