Large Language Models (LLMs) have shown promise in automated program reasoning, a crucial aspect of many security tasks. However, existing LLM architectures for code are often borrowed from other domains like natural language processing, raising concerns about their generalization and robustness to unseen code. A key generalization challenge is to incorporate the knowledge of code semantics, including control and data flow, into the LLM architectures. Drawing inspiration from examples of convolution layers exploiting translation symmetry, we explore how code symmetries can enhance LLM architectures for program analysis and modeling. We present a rigorous group-theoretic framework that formally defines code symmetries as semantics-preserving transformations and provides techniques for precisely reasoning about symmetry preservation within LLM architectures. Using this framework, we introduce a novel variant of self-attention that preserves program symmetries, demonstrating its effectiveness in generalization and robustness through detailed experimental evaluations across different binary and source code analysis tasks. Overall, our code symmetry framework offers rigorous and powerful reasoning techniques that can guide the future development of specialized LLMs for code and advance LLM-guided program reasoning tasks.
翻译:大型语言模型(LLM)在自动化程序推理(许多安全任务的关键环节)方面展现出巨大潜力。然而,现有代码LLM架构多源自自然语言处理等其他领域,这引发了对未见代码泛化能力和鲁棒性的担忧。一个关键的泛化挑战在于如何将包含控制流与数据流的代码语义知识融入LLM架构。受卷积层利用平移对称性这一范例启发,我们探索了代码对称性如何增强用于程序分析与建模的LLM架构。本文提出一个严格的群论框架,将代码对称性形式化为保持语义的变换,并提供在LLM架构中精确推理对称性保持的技术。基于该框架,我们引入了一种保持程序对称性的新型自注意力变体,并通过跨不同二进制与源代码分析任务的详细实验评估,证明了其在泛化性与鲁棒性方面的有效性。总体而言,我们的代码对称性框架提供了严谨且强大的推理技术,可为面向代码的专用LLM未来研发提供指导,并推动LLM引导的程序推理任务发展。