While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of digits, number words, and months. Through the application of circuit analysis techniques, we identify key sub-circuits responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Overall, documenting shared computational structures enables better prediction of model behaviors, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.
翻译:尽管Transformer模型在语言任务上展现出强大的能力,但其复杂的架构使其难以解释。近期研究致力于将Transformer模型逆向工程为人类可读的表示——即实现算法功能的电路。我们通过分析和比较类似序列续接任务(包括数字递增序列、数字词序列和月份序列)中的电路来拓展该研究。通过应用电路分析技术,我们识别出负责检测序列成员和预测序列下一成员的关键子电路。分析表明,语义相关的序列依赖于具有类似角色的共享电路子图。总体而言,记录共享计算结构有助于更好地预测模型行为、识别错误并实施更安全的编辑流程。对Transformer的这种机制化理解是构建更稳健、对齐且可解释的语言模型的关键一步。