Existing works are dedicated to untangling atomized numerical components (features) from the hidden states of Large Language Models (LLMs). However, they typically rely on autoencoders constrained by some training-time regularization on single training instances, without an explicit guarantee of global sparsity among instances, causing a large amount of dense (simultaneously inactive) features, harming the feature sparsity and atomization. In this paper, we propose a novel autoencoder variant that enforces minimal entropy on minibatches of hidden activations, thereby promoting feature independence and sparsity across instances. For efficient entropy calculation, we discretize the hidden activations to 1-bit via a step function and apply gradient estimation to enable backpropagation, so that we term it as Binary Autoencoder (BAE) and empirically demonstrate two major applications: (1) Feature set entropy calculation. Entropy can be reliably estimated on binary hidden activations, which can be leveraged to characterize the inference dynamics of LLMs. (2) Feature untangling. Compared to typical methods, due to improved training strategy, BAE avoids dense features while producing the largest number of interpretable ones among baselines.
翻译:现有研究致力于从大型语言模型(LLMs)的隐藏状态中解析原子化的数值成分(特征)。然而,这些方法通常依赖于在单个训练实例上施加训练时正则化约束的自编码器,缺乏对实例间全局稀疏性的显式保证,导致产生大量密集(同时非激活)特征,损害了特征的稀疏性与原子化程度。本文提出一种新颖的自编码器变体,该方法通过对隐藏激活的小批量数据施加最小熵约束,从而促进跨实例的特征独立性与稀疏性。为实现高效的熵计算,我们通过阶跃函数将隐藏激活离散化为1比特,并应用梯度估计以支持反向传播,因此将其命名为二元自编码器(BAE)。我们通过实证研究展示其两大主要应用:(1)特征集熵计算。基于二元隐藏激活可可靠估计熵值,该特性可用于刻画LLMs的推理动态。(2)特征解析。相较于典型方法,得益于改进的训练策略,BAE在避免产生密集特征的同时,在基线方法中生成数量最多的可解释特征。