In this work, we propose using mechanistic interpretability -- techniques for reverse engineering model weights into human-interpretable algorithms -- to derive and compactly prove formal guarantees on model performance. We prototype this approach by formally proving lower bounds on the accuracy of 151 small transformers trained on a Max-of-$K$ task. We create 102 different computer-assisted proof strategies and assess their length and tightness of bound on each of our models. Using quantitative metrics, we find that shorter proofs seem to require and provide more mechanistic understanding. Moreover, we find that more faithful mechanistic understanding leads to tighter performance bounds. We confirm these connections by qualitatively examining a subset of our proofs. Finally, we identify compounding structureless noise as a key challenge for using mechanistic interpretability to generate compact proofs on model performance.
翻译:本研究提出利用机制可解释性——将模型权重逆向工程转化为人类可解释算法的技术——来推导并简洁证明模型性能的形式化保证。我们通过对151个在Max-of-$K$任务上训练的小型Transformer模型进行准确率下界的正式证明,对该方法进行了原型验证。我们创建了102种不同的计算机辅助证明策略,并评估了它们在每个模型上的证明长度与边界紧密度。通过量化指标分析,我们发现更简短的证明似乎需要且能提供更深入的机制性理解。此外,研究结果表明更忠实的机制性理解会带来更严格的性能边界。我们通过对部分证明的定性检验验证了这些关联。最后,我们指出无结构噪声的复合效应是利用机制可解释性生成模型性能简洁证明的关键挑战。