Recent research in mechanistic interpretability has attempted to reverse-engineer Transformer models by carefully inspecting network weights and activations. However, these approaches require considerable manual effort and still fall short of providing complete, faithful descriptions of the underlying algorithms. In this work, we introduce a procedure for training Transformers that are mechanistically interpretable by design. We build on RASP [Weiss et al., 2021], a programming language that can be compiled into Transformer weights. Instead of compiling human-written programs into Transformers, we design a modified Transformer that can be trained using gradient-based optimization and then be automatically converted into a discrete, human-readable program. We refer to these models as Transformer Programs. To validate our approach, we learn Transformer Programs for a variety of problems, including an in-context learning task, a suite of algorithmic problems (e.g. sorting, recognizing Dyck-languages), and NLP tasks including named entity recognition and text classification. The Transformer Programs can automatically find reasonable solutions, performing on par with standard Transformers of comparable size; and, more importantly, they are easy to interpret. To demonstrate these advantages, we convert Transformers into Python programs and use off-the-shelf code analysis tools to debug model errors and identify the ``circuits'' used to solve different sub-problems. We hope that Transformer Programs open a new path toward the goal of intrinsically interpretable machine learning.
翻译:近期关于机制可解释性的研究试图通过仔细检查网络权重和激活来逆向工程Transformer模型。然而,这些方法需要大量人工投入,且仍无法为底层算法提供完整、忠实的描述。在本工作中,我们提出了一种训练Transformer的程序,使其在设计中天然具备机制可解释性。我们基于RASP [Weiss等人,2021]这一可编译为Transformer权重的编程语言,不再将人工编写的程序编译到Transformer中,而是设计了一种改进的Transformer,可通过基于梯度的优化进行训练,随后自动转换为离散、人类可读的程序。我们将这类模型称为Transformer程序。为验证该方法,我们针对多种问题学习Transformer程序,包括上下文学习任务、一系列算法问题(如排序、识别Dyck语言)以及自然语言处理任务(如命名实体识别和文本分类)。Transformer程序能够自动找到合理的解决方案,性能与同等规模的标准Transformer相当;且更重要的是,它们易于解释。为展示这些优势,我们将Transformer转换为Python程序,并利用现成的代码分析工具调试模型错误,识别用于解决不同子问题的"电路"。我们希望Transformer程序能为实现本质可解释机器学习的目标开辟新路径。