This paper tackles the challenge of teaching code semantics to Large Language Models (LLMs) for program analysis by incorporating code symmetries into the model architecture. We introduce a group-theoretic framework that defines code symmetries as semantics-preserving transformations, where forming a code symmetry group enables precise and efficient reasoning of code semantics. Our solution, SymC, develops a novel variant of self-attention that is provably equivariant to code symmetries from the permutation group defined over the program dependence graph. SymC obtains superior performance on five program analysis tasks, outperforming state-of-the-art code models, including GPT-4, without any pre-training. Our results suggest that code LLMs that encode the code structural prior via the code symmetry group generalize better and faster.
翻译:本文通过将代码对称性融入模型架构,解决了向大型语言模型(LLMs)教授程序分析中代码语义的挑战。我们提出了一种基于群论的框架,将代码对称性定义为保持语义的变换,其中形成代码对称群能够实现对代码语义的精确高效推理。我们的解决方案SymC开发了一种新颖的自注意力变体,该变体对基于程序依赖图定义的置换群所对应的代码对称性具有可证明的等变性。SymC在五项程序分析任务上取得了优异性能,无需任何预训练即超越了包括GPT-4在内的最先进代码模型。我们的结果表明,通过代码对称群编码代码结构先验的代码LLMs能够实现更优且更快的泛化能力。