Binary Autoencoder for Mechanistic Interpretability of Large Language Models

Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs). However, they typically rely on autoencoders constrained by some training-time regularization on single training instances, without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which can be leveraged to characterize the inference dynamics of LLMs. (2) Feature untangling. Compared to typical methods, due to improved training strategy, BAE avoids dense features while producing the largest number of interpretable ones among baselines.

翻译：现有研究致力于从大型语言模型（LLMs）的隐藏状态中解析原子化的数值成分（特征）。然而，这些方法通常依赖于在单个训练实例上施加训练时正则化约束的自编码器，缺乏对实例间全局稀疏性的显式保证，导致产生大量密集（同时非激活）特征，损害了特征的稀疏性与原子化程度。本文提出一种新颖的自编码器变体，该方法通过对隐藏激活的小批量数据施加最小熵约束，从而促进跨实例的特征独立性与稀疏性。为实现高效的熵计算，我们通过阶跃函数将隐藏激活离散化为1比特，并应用梯度估计以支持反向传播，因此将其命名为二元自编码器（BAE）。我们通过实证研究展示其两大主要应用：（1）特征集熵计算。基于二元隐藏激活可可靠估计熵值，该特性可用于刻画LLMs的推理动态。（2）特征解析。相较于典型方法，得益于改进的训练策略，BAE在避免产生密集特征的同时，在基线方法中生成数量最多的可解释特征。

相关内容

自编码器

关注 141

自动编码器是一种人工神经网络，用于以无监督的方式学习有效的数据编码。自动编码器的目的是通过训练网络忽略信号“噪声”来学习一组数据的表示（编码），通常用于降维。与简化方面一起，学习了重构方面，在此，自动编码器尝试从简化编码中生成尽可能接近其原始输入的表示形式，从而得到其名称。基本模型存在几种变体，其目的是迫使学习的输入表示形式具有有用的属性。自动编码器可有效地解决许多应用问题，从面部识别到获取单词的语义。

稀疏自编码器综述：解释大语言模型的内部机制

专知会员服务

17+阅读 · 2025年12月27日

大语言模型机器遗忘综述

专知会员服务

18+阅读 · 2025年11月2日

【ETZH博士论文】语言模型编程

专知会员服务

25+阅读 · 2025年6月14日

可解释人工智能中的大语言模型：全面综述

专知会员服务

53+阅读 · 2025年4月2日