Low-Rank Adapting Models for Sparse Autoencoders

Sparse autoencoders (SAEs) decompose language model representations into a sparse set of linear latent vectors. Recent works have improved SAEs using language model gradients, but these techniques require many expensive backward passes during training and still cause a significant increase in cross entropy loss when SAE reconstructions are inserted into the model. In this work, we improve on these limitations by taking a fundamentally different approach: we use low-rank adaptation (LoRA) to finetune the language model itself around a previously trained SAE. We analyze our method across SAE sparsity, SAE width, language model size, LoRA rank, and model layer on the Gemma Scope family of SAEs. In these settings, our method reduces the cross entropy loss gap by 30% to 55% when SAEs are inserted during the forward pass. We also find that compared to end-to-end (e2e) SAEs, our approach achieves the same downstream cross entropy loss 3$\times$ to 20$\times$ faster on Gemma-2-2B and 2$\times$ to 10$\times$ faster on Llama-3.2-1B. We further show that our technique improves downstream metrics and can adapt multiple SAEs at once. Our results demonstrate that improving model interpretability is not limited to post-hoc SAE training; Pareto improvements can also be achieved by directly optimizing the model itself.

翻译：稀疏自编码器（SAEs）将语言模型的表示分解为一组稀疏的线性潜在向量。近期研究通过利用语言模型的梯度改进了SAEs，但这些技术需要在训练过程中进行大量昂贵的反向传播计算，并且在将SAE重构结果插入模型时仍会导致交叉熵损失显著增加。在本工作中，我们采用一种根本不同的方法改进这些局限：利用低秩适配（LoRA）技术，围绕一个已训练好的SAE对语言模型本身进行微调。我们在Gemma Scope系列的SAEs上，从SAE稀疏度、SAE宽度、语言模型规模、LoRA秩以及模型层等多个维度分析了我们的方法。在这些设定下，当SAE在前向传播过程中被插入时，我们的方法将交叉熵损失的差距降低了30%至55%。我们还发现，与端到端（e2e）SAEs相比，我们的方法在Gemma-2-2B上达到相同下游交叉熵损失的速度快3倍至20倍，在Llama-3.2-1B上快2倍至10倍。我们进一步证明，该技术能改善下游评估指标，并且可以同时适配多个SAEs。我们的结果表明，提升模型可解释性并不局限于事后训练的SAE；通过直接优化模型本身也能实现帕累托改进。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日