While transformer models exhibit strong capabilities on linguistic tasks, their complex architectures make them difficult to interpret. Recent work has aimed to reverse engineer transformer models into human-readable representations called circuits that implement algorithmic functions. We extend this research by analyzing and comparing circuits for similar sequence continuation tasks, which include increasing sequences of digits, number words, and months. Through the application of circuit analysis techniques, we identify key sub-circuits responsible for detecting sequence members and for predicting the next member in a sequence. Our analysis reveals that semantically related sequences rely on shared circuit subgraphs with analogous roles. Overall, documenting shared computational structures enables better prediction of model behaviors, identification of errors, and safer editing procedures. This mechanistic understanding of transformers is a critical step towards building more robust, aligned, and interpretable language models.
翻译:尽管Transformer模型在语言任务中展现出强大的能力,但其复杂的架构使其难以解释。近期研究致力于将Transformer模型逆向工程化为可读的表示形式——即实现算法功能的电路。我们通过分析并比较类似序列延续任务(包括数字递增序列、数字词语序列和月份序列)中的电路,拓展了这一研究方向。通过应用电路分析技术,我们识别出负责检测序列成员和预测序列下一个成员的关键子电路。分析表明,语义相关的序列依赖于共享的电路子图,且这些子图具有类比功能。总体而言,记录共享计算结构能够更好地预测模型行为、识别错误,并实现更安全的编辑流程。这种对Transformer的机制性理解,是构建更鲁棒、更对齐且可解释的语言模型的关键一步。