Sparse autoencoders (SAEs) have become a standard tool for mechanistic interpretability in autoregressive large language models (LLMs), enabling researchers to extract sparse, human-interpretable features and intervene on model behavior. Recently, as diffusion language models (DLMs) have become an increasingly promising alternative to the autoregressive LLMs, it is essential to develop tailored mechanistic interpretability tools for this emerging class of models. In this work, we present DLM-Scope, the first SAE-based interpretability framework for DLMs, and demonstrate that trained Top-K SAEs can faithfully extract interpretable features. Notably, we find that inserting SAEs affects DLMs differently than autoregressive LLMs: while SAE insertion in LLMs typically incurs a loss penalty, in DLMs it can reduce cross-entropy loss when applied to early layers, a phenomenon absent or markedly weaker in LLMs. Additionally, SAE features in DLMs enable more effective diffusion-time interventions, often outperforming LLM steering. Moreover, we pioneer certain new SAE-based research directions for DLMs: we show that SAEs can provide useful signals for DLM decoding order; and the SAE features are stable during the post-training phase of DLMs. Our work establishes a foundation for mechanistic interpretability in DLMs and shows a great potential of applying SAEs to DLM-related tasks and algorithms.
翻译:稀疏自编码器已成为自回归大语言模型机制可解释性研究的标准工具,使研究人员能够提取稀疏、人类可解释的特征并干预模型行为。近年来,随着扩散语言模型逐渐成为自回归大语言模型的重要替代方案,为这类新兴模型开发定制化的机制可解释性工具显得尤为关键。本研究提出DLM-Scope——首个基于稀疏自编码器的扩散语言模型可解释性框架,并证明训练后的Top-K稀疏自编码器能够可靠地提取可解释特征。值得注意的是,我们发现稀疏自编码器的插入对扩散语言模型的影响与自回归大语言模型存在差异:在自回归大语言模型中插入稀疏自编码器通常会导致损失惩罚,而在扩散语言模型的早期层应用稀疏自编码器反而能降低交叉熵损失,这种现象在自回归大语言模型中不存在或显著较弱。此外,扩散语言模型中的稀疏自编码器特征支持更有效的扩散时间干预,其效果常优于自回归大语言模型的导向控制。我们进一步开创了若干基于稀疏自编码器的扩散语言模型研究方向:证明稀疏自编码器能为扩散语言模型的解码顺序提供有效信号;同时发现稀疏自编码器特征在扩散语言模型的后训练阶段保持稳定。本研究为扩散语言模型的机制可解释性研究奠定了基础,并展现了将稀疏自编码器应用于扩散语言模型相关任务与算法的巨大潜力。