Sensitive directions experiments attempt to understand the computational features of Language Models (LMs) by measuring how much the next token prediction probabilities change by perturbing activations along specific directions. We extend the sensitive directions work by introducing an improved baseline for perturbation directions. We demonstrate that KL divergence for Sparse Autoencoder (SAE) reconstruction errors are no longer pathologically high compared to the improved baseline. We also show that feature directions uncovered by SAEs have varying impacts on model outputs depending on the SAE's sparsity, with lower L0 SAE feature directions exerting a greater influence. Additionally, we find that end-to-end SAE features do not exhibit stronger effects on model outputs compared to traditional SAEs.
翻译:敏感方向实验试图通过沿特定方向扰动激活值,测量下一词元预测概率的变化程度,以理解语言模型的计算特征。我们通过引入一种改进的扰动方向基线方法,拓展了敏感方向的研究。实验表明,与改进后的基线相比,稀疏自编码器重建误差的KL散度不再呈现病态高值。我们还发现,SAE揭示的特征方向对模型输出的影响程度随SAE稀疏度的不同而变化,其中L0值较低的SAE特征方向具有更强的影响力。此外,研究发现端到端SAE特征相较于传统SAE并未对模型输出产生更显著的影响。