We propose a novel framework that provides constructive feedback to an LLM in the "guess-and-check" paradigm by formally verifying its own thinking process and detecting local reasoning errors. We apply this framework to the loop invariant synthesis problem. We prompt the model to produce a step-by-step natural language proof justifying its thinking process for the failed verification condition of its generated loop invariants. Then, we use an LLM to translate the reasoning steps into first-order logic implications, which can be checked automatically. An invalid implication pinpoints the exact logical flaw in the LLM's thinking process, which we then use to construct targeted feedback for refinement. We have implemented our approach in a tool called LORIS and evaluated it on a main benchmark suite of 460 C programs and an additional benchmark suite of 50 C programs each of which involves non-linear properties. On the main benchmark suite, LORIS solved 445 of the programs, and achieved an overall success rate of $93.1\%$. LORIS also demonstrates robustness on the challenging non-linear benchmark suite.
翻译:我们提出了一种新颖框架,在“猜测-验证”范式中通过形式化验证LLM自身的思考过程并检测局部推理错误,为其提供建设性反馈。我们将该框架应用于循环不变式综合问题。当LLM生成的循环不变式验证条件失败时,我们提示模型生成逐步的自然语言证明以阐释其思考过程。随后,利用LLM将这些推理步骤转化为一阶逻辑蕴含式,并由自动化工具进行验证。若蕴含式无效,即可精确定位LLM思维过程中的逻辑缺陷,我们据此构建针对性反馈以优化生成结果。该方案已在LORIS工具中实现,并在包含460个C程序的主基准测试集及额外50个涉及非线性属性的C程序基准集上完成评估。主基准测试中,LORIS成功求解445个程序,总体成功率达$93.1\%$。在处理挑战性的非线性基准集时,LORIS同样展现出稳健性。