FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

翻译：LLM辅助的软件开发日益普遍，可生成编译器这样的大型系统。加强生成代码的正确性变得至关重要。然而，由于代码复杂度，大型系统的自动化推理仍具挑战性。霍尔逻辑提供了一种将大型系统分解为更小组件并分别推理（即组合推理）的方法。然而现有方法仍难以扩展，因为霍尔逻辑需要为每个函数编写形式规约，带来沉重的人力负担。当代码由LLM生成时，问题更加严重——开发者缺乏对每个函数预期行为的深入理解。本文提出FM-Agent，首个实现大型系统自动化组合推理的框架。FM-Agent利用LLM，引入自顶向下范式自动生成函数级规约。具体而言，FM-Agent通过分析调用者期望函数的行为来推导函数规约，因此即使实现存在缺陷，生成的规约也能反映开发者的函数意图。开发者意图通常以自然语言表达，而现有验证器仅支持公式。为此FM-Agent将霍尔风格推理泛化，使函数能针对自然语言规约进行推理。最后，为确认缺陷存在并解释缺陷成因，FM-Agent自动生成测试用例触发潜在缺陷。实验评估中，FM-Agent在2天内成功完成了对代码量高达14.3万行的大型系统的推理。这些系统已由开发者进行过测试，但FM-Agent仍发现了522个新缺陷，这些缺陷可导致系统崩溃和错误执行结果等严重后果。