In this work, we propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-$K$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
翻译:在本研究中,我们提出利用机制可解释性——将模型权重逆向工程转化为人类可解释算法的技术——来推导并紧凑地证明模型性能的形式化保证。我们通过在Max-of-$K$任务上训练的151个小型Transformer模型上形式化证明其准确率下界,对该方法进行了原型验证。我们创建了102种不同的计算机辅助证明策略,并评估了它们在每个模型上的证明长度与边界紧密度。通过量化指标分析,我们发现更简短的证明似乎需要且能提供更多机制性理解。此外,我们发现更忠实的机制性理解会带来更严格的性能边界。我们通过对部分证明的定性分析验证了这些关联。最后,我们指出叠加的无结构噪声是利用机制可解释性生成模型性能紧凑证明面临的核心挑战。