Interpretability remains a key challenge for deploying large language models (LLMs) in clinical settings such as Alzheimer's disease progression diagnosis, where early and trustworthy predictions are essential. Existing attribution methods exhibit high inter-method variability and unstable explanations due to the polysemantic nature of LLM representations, while mechanistic interpretability approaches lack direct alignment with model inputs and outputs and do not provide explicit importance scores. We introduce a unified interpretability framework that integrates attributional and mechanistic perspectives through monosemantic feature extraction. By constructing a monosemantic embedding space at the level of an LLM layer and optimizing the framework to explicitly reduce inter-method variability, our approach produces stable input-level importance scores and highlights salient features via a decompressed representation of the layer of interest, advancing the safe and trustworthy application of LLMs in cognitive health and neurodegenerative disease.
翻译:在阿尔茨海默病进展诊断等临床场景中,早期且可信的预测至关重要,而可解释性仍是部署大语言模型所面临的核心挑战。现有归因方法因大语言模型表征的多义性而存在方法间高变异性和解释不稳定的问题;机制可解释性方法则缺乏与模型输入输出的直接对齐,且无法提供显式重要性评分。本文提出一种统一的可解释性框架,通过单义特征提取融合归因视角与机制视角。该框架在大语言模型单层级别构建单义嵌入空间,并通过显式优化降低方法间变异性,从而生成稳定的输入级重要性评分,同时通过目标层的解压缩表征突出关键特征。这一研究推动了大语言模型在认知健康与神经退行性疾病领域的安全可信应用。