Neural networks have in recent years shown promise for helping software engineers write programs and even formally verify them. While semantic information plays a crucial part in these processes, it remains unclear to what degree popular neural architectures like transformers are capable of modeling that information. This paper examines the behavior of neural networks learning algorithms relevant to programs and formal verification proofs through the lens of mechanistic interpretability, focusing in particular on structural recursion. Structural recursion is at the heart of tasks on which symbolic tools currently outperform neural models, like inferring semantic relations between datatypes and emulating program behavior. We evaluate the ability of transformer models to learn to emulate the behavior of structurally recursive functions from input-output examples. Our evaluation includes empirical and conceptual analyses of the limitations and capabilities of transformer models in approximating these functions, as well as reconstructions of the ``shortcut" algorithms the model learns. By reconstructing these algorithms, we are able to correctly predict 91 percent of failure cases for one of the approximated functions. Our work provides a new foundation for understanding the behavior of neural networks that fail to solve the very tasks they are trained for.
翻译:摘要:近年来,神经网络在辅助软件工程师编写程序甚至进行形式化验证方面展现出潜力。尽管语义信息在这些过程中起着关键作用,但流行的神经架构(如Transformer)在多大程度上能够建模这些信息仍不清楚。本文通过机械可解释性的视角,研究了神经网络学习与程序及形式化验证证明相关算法的行为,特别关注结构递归。结构递归是符号工具目前优于神经模型的任务(如推断数据类型间的语义关系和模拟程序行为)的核心。我们评估了Transformer模型从输入-输出示例中学习模拟结构递归函数行为的能力。我们的评估包括对Transformer模型在逼近这些函数时的局限性和能力的经验性及概念性分析,以及对模型所学习的“捷径”算法的重建。通过重建这些算法,我们能够正确预测其中一个近似函数91%的失败案例。我们的工作为理解神经网络无法解决其训练任务的这一行为提供了新的基础。