We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code. We test MIPS on a benchmark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary to GPT-4: MIPS solves 32 of them, including 13 that are not solved by GPT-4 (which also solves 30). MIPS uses an integer autoencoder to convert the RNN into a finite state machine, then applies Boolean or integer symbolic regression to capture the learned algorithm. As opposed to large language models, this program synthesis technique makes no use of (and is therefore not limited by) human training data such as algorithms and code from GitHub. We discuss opportunities and challenges for scaling up this approach to make machine-learned models more interpretable and trustworthy.
翻译:我们提出MIPS,一种基于自动化机制可解释性的新型程序合成方法,该方法对训练完成所需任务的神经网络进行逆向解析,将学习到的算法自动蒸馏为Python代码。我们在包含62个可由RNN学习的算法任务基准上测试MIPS,发现其与GPT-4具有高度互补性:MIPS成功解决了其中32个任务,包括13个GPT-4未能解决的任务(GPT-4解决了30个任务)。MIPS采用整数自编码器将RNN转化为有限状态机,随后通过布尔或整数符号回归捕获学习到的算法。与大型语言模型不同,该程序合成技术完全不使用(因此不受限于)人类训练数据,例如来自GitHub的算法和代码。我们讨论了扩展该方法以提升机器学习模型可解释性与可信度的机遇与挑战。