Mechanistic Interpretability (MI) has emerged as a vital approach to demystify the opaque decision-making of Large Language Models (LLMs). However, existing reviews primarily treat MI as an observational science, summarizing analytical insights while lacking a systematic framework for actionable intervention. To bridge this gap, we present a practical survey structured around the pipeline: "Locate, Steer, and Improve." We formally categorize Localizing (diagnosis) and Steering (intervention) methods based on specific Interpretable Objects to establish a rigorous intervention protocol. Furthermore, we demonstrate how this framework enables tangible improvements in Alignment, Capability, and Efficiency, effectively operationalizing MI as an actionable methodology for model optimization. The curated paper list of this work is available at https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey.
翻译:机制可解释性已成为阐释大语言模型不透明决策过程的重要方法。然而,现有综述主要将机制可解释性视为观测科学,侧重于总结分析性见解,而缺乏针对可操作性干预的系统性框架。为弥补这一空白,本文提出围绕“定位、调控与改进”流程构建的实用综述。我们基于特定可解释对象对定位(诊断)与调控(干预)方法进行形式化分类,以建立严谨的干预规程。此外,我们论证了该框架如何在对齐性、能力与效率三个维度实现实质性改进,从而将机制可解释性有效转化为模型优化的可操作方法。本工作的精选文献列表发布于 https://github.com/rattlesnakey/Awesome-Actionable-MI-Survey。