Large Language Models (LLMs) have shown promise in automated program reasoning, a crucial aspect of many security tasks. However, existing LLM architectures for code are often borrowed from other domains like natural language processing, raising concerns about their generalization and robustness to unseen code. A key generalization challenge is to incorporate the knowledge of code semantics, including control and data flow, into the LLM architectures. Drawing inspiration from examples of convolution layers exploiting translation symmetry, we explore how code symmetries can enhance LLM architectures for program analysis and modeling. We present a rigorous group-theoretic framework that formally defines code symmetries as semantics-preserving transformations and provides techniques for precisely reasoning about symmetry preservation within LLM architectures. Using this framework, we introduce a novel variant of self-attention that preserves program symmetries, demonstrating its effectiveness in generalization and robustness through detailed experimental evaluations across different binary and source code analysis tasks. Overall, our code symmetry framework offers rigorous and powerful reasoning techniques that can guide the future development of specialized LLMs for code and advance LLM-guided program reasoning tasks.
翻译:大型语言模型(LLMs)在自动化程序推理方面展现出潜力,这是许多安全任务的关键环节。然而,现有面向代码的LLM架构常借鉴自自然语言处理等其他领域,引发了其对未见代码的泛化能力和鲁棒性的担忧。一个关键的泛化挑战在于将代码语义知识(包括控制流和数据流)融入LLM架构中。受利用平移对称性的卷积层示例启发,我们探索了代码对称性如何增强面向程序分析和建模的LLM架构。我们提出了一个严谨的群论框架,该框架将代码对称性形式化定义为保持语义的变换,并提供了在LLM架构中精确推理对称性保持的技术。利用该框架,我们引入了一种保持程序对称性的自注意力机制变体,并通过跨不同二进制和源代码分析任务的详细实验评估,证明了其在泛化能力和鲁棒性方面的有效性。总体而言,我们的代码对称性框架提供了严谨而强大的推理技术,可指导未来面向代码的专用LLM开发,并推进LLM引导的程序推理任务。