Imandra CodeLogician: Neuro-Symbolic Reasoning for Precise Analysis of Software Logic

Large Language Models (LLMs) have shown strong performance on code understanding tasks, yet they fundamentally lack the ability to perform precise, exhaustive mathematical reasoning about program behavior. Existing benchmarks either focus on mathematical proof automation, largely disconnected from real-world software, or on engineering tasks that do not require semantic rigor. We present CodeLogician, a neurosymbolic agent for precise analysis of software logic, integrated with ImandraX, an industrial automated reasoning engine deployed in financial markets and safety-critical systems. Unlike prior approaches that use formal methods primarily to validate LLM outputs, CodeLogician uses LLMs to construct explicit formal models of software systems, enabling automated reasoning to answer rich semantic questions beyond binary verification outcomes. To rigorously evaluate mathematical reasoning about software logic, we introduce code-logic-bench, a benchmark targeting the middle ground between theorem proving and software engineering benchmarks. It measures reasoning correctness about program state spaces, control flow, coverage constraints, and edge cases, with ground truth defined via formal modeling and region decomposition. Comparing LLM-only reasoning against LLMs augmented with CodeLogician, formal augmentation yields substantial improvements, closing a 41-47 percentage point gap in reasoning accuracy. These results demonstrate that neurosymbolic integration is essential for scaling program analysis toward rigorous, autonomous software understanding.

翻译：大型语言模型（LLM）在代码理解任务上表现出强大的性能，但其本质上缺乏对程序行为进行精确、穷尽的数学推理的能力。现有基准测试要么侧重于与现实软件基本脱节的数学证明自动化，要么侧重于不需要语义严格性的工程任务。本文提出 CodeLogician，一种用于软件逻辑精确分析的神经符号智能体，其与 ImandraX（一个部署于金融市场和安全关键系统的工业级自动推理引擎）集成。与先前主要使用形式化方法来验证 LLM 输出的方法不同，CodeLogician 利用 LLM 构建软件系统的显式形式化模型，使得自动推理能够回答超越二元验证结果的丰富语义问题。为了严格评估关于软件逻辑的数学推理能力，我们引入了 code-logic-bench，这是一个定位于定理证明与软件工程基准测试中间地带的基准。它衡量关于程序状态空间、控制流、覆盖约束和边界情况的推理正确性，其真值通过形式化建模和区域分解定义。将纯 LLM 推理与通过 CodeLogician 增强的 LLM 进行对比，形式化增强带来了显著改进，将推理准确性的差距缩小了 41-47 个百分点。这些结果表明，神经符号集成对于将程序分析扩展至严格、自主的软件理解至关重要。