Interpretability research aims to build tools for understanding machine learning (ML) models. However, such tools are inherently hard to evaluate because we do not have ground truth information about how ML models actually work. In this work, we propose to build transformer models manually as a testbed for interpretability research. We introduce Tracr, a "compiler" for translating human-readable programs into weights of a transformer model. Tracr takes code written in RASP, a domain-specific language (Weiss et al. 2021), and translates it into weights for a standard, decoder-only, GPT-like transformer architecture. We use Tracr to create a range of ground truth transformers that implement programs including computing token frequencies, sorting, and Dyck-n parenthesis checking, among others. To enable the broader research community to explore and use compiled models, we provide an open-source implementation of Tracr at https://github.com/deepmind/tracr.
翻译:可解释性研究旨在构建理解机器学习(ML)模型的工具。然而,这类工具本身就难以评估,因为我们缺乏关于ML模型实际工作原理的真实信息。在这项工作中,我们提出通过手动构建Transformer模型作为可解释性研究的测试平台。我们介绍了Tracr这一能够将人类可读程序翻译成Transformer模型权重的"编译器"。Tracr接受用领域特定语言RASP编写的代码(Weiss等人,2021),并将其转换为标准解码器专用GPT类Transformer架构的权重。我们利用Tracr创建了一系列具有真实信息的Transformer模型,实现了包括计算词频、排序和Dyck-n括号匹配等功能。为促进更广泛的研究社区探索和使用编译模型,我们已在https://github.com/deepmind/tracr开源了Tracr实现。