FADE: Why Bad Descriptions Happen to Good Features

Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While they may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing FADE: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for evaluating feature-description alignment. FADE evaluates alignment across four key metrics - Clarity, Responsiveness, Purity, and Faithfulness - and systematically quantifies the causes for the misalignment of feature and their description. We apply FADE to analyze existing open-source feature descriptions, and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs as compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release FADE as an open-source package at: https://github.com/brunibrun/FADE.

翻译：近期机制可解释性领域的进展突显了自动化可解释性流水线在分析大型语言模型潜在表征方面的潜力。尽管这些方法可能增进我们对内部机制的理解，但该领域仍缺乏评估已发现特征有效性的标准化评估方法。我们试图通过引入FADE（特征与描述对齐评估）来弥合这一差距——这是一个可扩展的、模型无关的框架，用于评估特征与描述的对齐程度。FADE通过四个关键指标（清晰度、响应性、纯净度与忠实度）评估对齐效果，并系统化量化特征与其描述失配的成因。我们应用FADE分析现有开源特征描述，并评估自动化可解释性流水线的核心组件，旨在提升描述质量。研究结果揭示了生成特征描述的根本性挑战，特别是稀疏自编码器相较于MLP神经元所面临的特殊困难，为自动化可解释性的局限性与未来发展方向提供了重要见解。我们已在 https://github.com/brunibrun/FADE 开源FADE工具包。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日