In this work, we propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally lower bounding the accuracy of 151 small transformers trained on a Max-of-$k$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we show that shorter proofs seem to require and provide more mechanistic understanding, and that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
翻译:在本研究中,我们提出利用机制可解释性——一种将模型权重逆向工程转化为人类可解释算法的技术——来推导并简洁地证明模型性能的形式化保证。我们通过在Max-of-$k$任务上训练的151个小型Transformer模型上形式化地界定其准确率下界,对该方法进行了原型验证。我们创建了102种不同的计算机辅助证明策略,并评估了它们在每个模型上的证明长度与边界紧密度。通过量化指标分析,我们证明较短的证明似乎需要并提供更深入的机制性理解,而更忠实的机制性理解能产生更严格的性能边界。我们通过对部分证明的定性分析验证了这些关联。最后,我们指出无结构噪声的复合效应是利用机制可解释性生成模型性能简洁证明所面临的关键挑战。