This paper introduces an efficient and robust method for discovering interpretable circuits in large language models using discrete sparse autoencoders. Our approach addresses key limitations of existing techniques, namely computational complexity and sensitivity to hyperparameters. We propose training sparse autoencoders on carefully designed positive and negative examples, where the model can only correctly predict the next token for the positive examples. We hypothesise that learned representations of attention head outputs will signal when a head is engaged in specific computations. By discretising the learned representations into integer codes and measuring the overlap between codes unique to positive examples for each head, we enable direct identification of attention heads involved in circuits without the need for expensive ablations or architectural modifications. On three well-studied tasks - indirect object identification, greater-than comparisons, and docstring completion - the proposed method achieves higher precision and recall in recovering ground-truth circuits compared to state-of-the-art baselines, while reducing runtime from hours to seconds. Notably, we require only 5-10 text examples for each task to learn robust representations. Our findings highlight the promise of discrete sparse autoencoders for scalable and efficient mechanistic interpretability, offering a new direction for analysing the inner workings of large language models.
翻译:本文提出一种利用离散稀疏自编码器高效且稳健发现大型语言模型中可解释电路的方法。该方法解决了现有技术的关键局限,即计算复杂度高和超参数敏感。我们通过精心设计的正反例训练稀疏自编码器,使模型仅能对正例正确预测下一个词元。我们假设注意力头输出的学习表征能指示该头是否参与特定计算。通过将学习表征离散化为整数编码,并测量每个头正例独有编码的重叠度,可直接识别参与电路的注意力头,无需昂贵的消融实验或架构修改。在三个经典任务(间接宾语识别、数值比较和文档字符串补全)中,该方法在恢复真实电路时,精度与召回率均优于当前最优基线,同时将运行时间从数小时缩短至秒级。值得注意的是,每个任务仅需5-10个文本示例即可学习稳健表征。本研究结果凸显了离散稀疏自编码器在可扩展高效机制可解释性中的潜力,为分析大型语言模型内部运作机制提供了新方向。