Sparse autoencoders (SAEs) decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes during training and still cause a significant increase in cross entropy loss when SAE reconstructions are inserted into the model. In this work, we improve on these limitations by taking a fundamentally different approach: we use low-rank adaptation (LoRA) to finetune the language model itself around a previously trained SAE. We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30% to 55% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3$\times$ to 20$\times$ faster on Gemma-2-2B and 2$\times$ to 10$\times$ faster on Llama-3.2-1B. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once. Our results demonstrate that improving model interpretability is not limited to post-hoc SAE training; Pareto improvements can also be achieved by directly optimizing the model itself.
翻译:稀疏自编码器(SAEs)将语言模型的表示分解为一组稀疏的线性潜在向量。近期研究通过利用语言模型的梯度改进了SAEs,但这些技术需要在训练过程中进行大量昂贵的反向传播计算,并且在将SAE重构结果插入模型时仍会导致交叉熵损失显著增加。在本工作中,我们采用一种根本不同的方法改进这些局限:利用低秩适配(LoRA)技术,围绕一个已训练好的SAE对语言模型本身进行微调。我们在Gemma Scope系列的SAEs上,从SAE稀疏度、SAE宽度、语言模型规模、LoRA秩以及模型层等多个维度分析了我们的方法。在这些设定下,当SAE在前向传播过程中被插入时,我们的方法将交叉熵损失的差距降低了30%至55%。我们还发现,与端到端(e2e)SAEs相比,我们的方法在Gemma-2-2B上达到相同下游交叉熵损失的速度快3倍至20倍,在Llama-3.2-1B上快2倍至10倍。我们进一步证明,该技术能改善下游评估指标,并且可以同时适配多个SAEs。我们的结果表明,提升模型可解释性并不局限于事后训练的SAE;通过直接优化模型本身也能实现帕累托改进。