Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning

Model diffing is the study of how fine-tuning changes a model's representations and internal algorithms. Many behaviors of interest are introduced during fine-tuning, and model diffing offers a promising lens to interpret such behaviors. Crosscoders are a recent model diffing method that learns a shared dictionary of interpretable concepts represented as latent directions in both the base and fine-tuned models, allowing us to track how concepts shift or emerge during fine-tuning. Notably, prior work has observed concepts with no direction in the base model, and it was hypothesized that these model-specific latents were concepts introduced during fine-tuning. However, we identify two issues which stem from the crosscoders L1 training loss that can misattribute concepts as unique to the fine-tuned model, when they really exist in both models. We develop Latent Scaling to flag these issues by more accurately measuring each latent's presence across models. In experiments comparing Gemma 2 2B base and chat models, we observe that the standard crosscoder suffers heavily from these issues. Building on these insights, we train a crosscoder with BatchTopK loss and show that it substantially mitigates these issues, finding more genuinely chat-specific and highly interpretable concepts. We recommend practitioners adopt similar techniques. Using the BatchTopK crosscoder, we successfully identify a set of chat-specific latents that are both interpretable and causally effective, representing concepts such as $\textit{false information}$ and $\textit{personal question}$, along with multiple refusal-related latents that show nuanced preferences for different refusal triggers. Overall, our work advances best practices for the crosscoder-based methodology for model diffing and demonstrates that it can provide concrete insights into how chat-tuning modifies model behavior.

翻译：模型差异分析旨在研究微调如何改变模型的表征与内部算法。许多关注的行为是在微调过程中引入的，而模型差异分析为解释此类行为提供了一个有前景的视角。Crosscoders是近期提出的一种模型差异分析方法，它学习一个由可解释概念构成的共享字典，这些概念在基础模型和微调模型中均表示为潜在方向，使我们能够追踪概念在微调过程中如何迁移或涌现。值得注意的是，先前工作观察到某些概念在基础模型中不存在对应方向，并假设这些模型特定的潜在方向是微调过程中引入的新概念。然而，我们发现两个源于crosscoders L1训练损失的问题，可能导致将实际存在于两个模型中的概念错误地归因为微调模型所独有。我们开发了潜在方向缩放技术，通过更精确地测量每个潜在方向在不同模型中的存在程度来标记这些问题。在比较Gemma 2 2B基础模型与聊天模型的实验中，我们观察到标准crosscoder严重受到这些问题的影响。基于这些发现，我们使用BatchTopK损失训练了一个crosscoder，并证明它能显著缓解这些问题，发现了更多真正聊天模型特有且高度可解释的概念。我们建议实践者采用类似技术。通过使用BatchTopK crosscoder，我们成功识别出一组兼具可解释性与因果有效性的聊天特有潜在方向，这些方向表征了诸如$\textit{虚假信息}$和$\textit{个人问题}$等概念，以及多个与拒绝行为相关的潜在方向，它们对不同拒绝触发条件表现出细致入微的偏好。总体而言，我们的工作推进了基于crosscoder的模型差异分析方法的最佳实践，并证明该方法能够为理解聊天调优如何改变模型行为提供具体洞见。