Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While they may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing FADE: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for evaluating feature-description alignment. FADE evaluates alignment across four key metrics - Clarity, Responsiveness, Purity, and Faithfulness - and systematically quantifies the causes for the misalignment of feature and their description. We apply FADE to analyze existing open-source feature descriptions, and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs as compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release FADE as an open-source package at: https://github.com/brunibrun/FADE.
翻译:近期机制可解释性领域的进展突显了自动化可解释性流水线在分析大型语言模型潜在表征方面的潜力。尽管这些方法可能增进我们对内部机制的理解,但该领域仍缺乏评估已发现特征有效性的标准化评估方法。我们试图通过引入FADE(特征与描述对齐评估)来弥合这一差距——这是一个可扩展的、模型无关的框架,用于评估特征与描述的对齐程度。FADE通过四个关键指标(清晰度、响应性、纯净度与忠实度)评估对齐效果,并系统化量化特征与其描述失配的成因。我们应用FADE分析现有开源特征描述,并评估自动化可解释性流水线的核心组件,旨在提升描述质量。研究结果揭示了生成特征描述的根本性挑战,特别是稀疏自编码器相较于MLP神经元所面临的特殊困难,为自动化可解释性的局限性与未来发展方向提供了重要见解。我们已在 https://github.com/brunibrun/FADE 开源FADE工具包。