Neural networks have in recent years shown promise for helping software engineers write programs and even formally verify them. While semantic information plays a crucial part in these processes, it remains unclear to what degree popular neural architectures like transformers are capable of modeling that information. This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification proofs through the lens of mechanistic interpretability, focusing in particular on structural recursion. Structural recursion is at the heart of tasks on which symbolic tools currently outperform neural models, like inferring semantic relations between datatypes and emulating program behavior. We evaluate the ability of transformer models to learn to emulate the behavior of structurally recursive functions from input-output examples. Our evaluation includes empirical and conceptual analyses of the limitations and capabilities of transformer models in approximating these functions, as well as reconstructions of the ``shortcut" algorithms the model learns. By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions. Our work provides a new foundation for understanding the behavior of neural networks that fail to solve the very tasks they are trained for.
翻译:近年来,神经网络在辅助软件工程师编写程序甚至进行形式化验证方面展现出潜力。尽管语义信息在这些过程中起着关键作用,但流行的神经架构(如Transformers)能在多大程度上建模该类信息仍不明确。本文通过机械可解释性的视角,聚焦结构递归问题,研究学习与程序及形式化验证证明相关算法的神经网络行为。结构递归是符号工具目前优于神经模型的任务(如推断数据类型间的语义关系及模拟程序行为)的核心。我们评估了Transformers模型从输入输出样例中学习模拟结构递归函数行为的能力。评估涵盖了对Transformers模型逼近这些函数时局限性与能力的经验与概念分析,以及对模型所习得的“捷径”算法的重建。通过重建这些算法,我们能够正确预测其中一个被逼近函数91%的失败案例。本研究为理解神经网络在无法完成其训练任务时的行为提供了新基础。