Large Language Models (LLMs) have shown promise in automated program reasoning, a crucial aspect of many security tasks. However, existing LLM architectures for code are often borrowed from other domains like natural language processing, raising concerns about their generalization and robustness to unseen code. A key generalization challenge is to incorporate the knowledge of code semantics, including control and data flow, into the LLM architectures. Drawing inspiration from examples of convolution layers exploiting translation symmetry, we explore how code symmetries can enhance LLM architectures for program analysis and modeling. We present a rigorous group-theoretic framework that formally defines code symmetries as semantics-preserving transformations and provides techniques for precisely reasoning about symmetry preservation within LLM architectures. Using this framework, we introduce a novel variant of self-attention that preserves program symmetries, demonstrating its effectiveness in generalization and robustness through detailed experimental evaluations across different binary and source code analysis tasks. Overall, our code symmetry framework offers rigorous and powerful reasoning techniques that can guide the future development of specialized LLMs for code and advance LLM-guided program reasoning tasks.
翻译:大型语言模型在自动化程序推理方面展现出潜力,这是许多安全任务的关键方面。然而,现有用于代码的LLM架构通常借鉴自自然语言处理等其他领域,引发了对其泛化能力和对未见代码鲁棒性的担忧。一个关键的泛化挑战是将代码语义知识(包括控制流和数据流)整合到LLM架构中。受卷积层利用平移对称性的示例启发,我们探索了代码对称性如何增强用于程序分析和建模的LLM架构。我们提出了一个严格的群论框架,该框架将代码对称性正式定义为保持语义的变换,并提供了在LLM架构中精确推理对称性保持的技术。利用这一框架,我们引入了一种保留程序对称性的自注意力变体,并通过在不同二进制和源代码分析任务上的详细实验评估,证明了其在泛化和鲁棒性方面的有效性。总体而言,我们的代码对称性框架提供了严谨而强大的推理技术,可指导未来专门用于代码的LLM开发,并推动LLM引导的程序推理任务。