Mechanistic interpretability methods aim to identify the algorithm a neural network implements, but it is difficult to validate such methods when the true algorithm is unknown. This work presents InterpBench, a collection of semi-synthetic yet realistic transformers with known circuits for evaluating these techniques. We train these neural networks using a stricter version of Interchange Intervention Training (IIT) which we call Strict IIT (SIIT). Like the original, SIIT trains neural networks by aligning their internal computation with a desired high-level causal model, but it also prevents non-circuit nodes from affecting the model's output. We evaluate SIIT on sparse transformers produced by the Tracr tool and find that SIIT models maintain Tracr's original circuit while being more realistic. SIIT can also train transformers with larger circuits, like Indirect Object Identification (IOI). Finally, we use our benchmark to evaluate existing circuit discovery techniques.
翻译:机制可解释性方法旨在识别神经网络所实现的算法,但当真实算法未知时,验证此类方法存在困难。本研究提出了InterpBench——一个包含已知电路结构的半合成且具有现实性的Transformer模型集合,用于评估这些技术。我们采用严格版本的交换干预训练(IIT)来训练这些神经网络,称之为严格IIT(SIIT)。与原始方法类似,SIIT通过将神经网络的内部计算与期望的高层因果模型对齐来进行训练,但同时防止非电路节点影响模型输出。我们在Tracr工具生成的稀疏Transformer上评估SIIT,发现SIIT模型在保持Tracr原始电路结构的同时更具现实性。SIIT还能训练具有更大电路结构的Transformer,例如间接宾语识别(IOI)任务。最后,我们使用该基准测试评估了现有的电路发现技术。