We present Isabellm, an LLM-powered theorem prover for Isabelle/HOL that performs fully automatic proof synthesis. Isabellm works with any local LLM on Ollama and APIs such as Gemini CLI, and it is designed to run on consumer grade computers. The system combines a stepwise prover, which uses large language models to propose proof commands validated by Isabelle in a bounded search loop, with a higher-level proof planner that generates structured Isar outlines and attempts to fill and repair remaining gaps. The framework includes beam search for tactics, tactics reranker ML and RL models, premise selection with small transformer models, micro-RAG for Isar proofs built from AFP, and counter-example guided proof repair. All the code is implemented by GPT 4.1 - 5.2, Gemini 3 Pro, and Claude 4.5. Empirically, Isabellm can prove certain lemmas that defeat Isabelle's standard automation, including Sledgehammer, demonstrating the practical value of LLM-guided proof search. At the same time, we find that even state-of-the-art LLMs, such as GPT 5.2 Extended Thinking and Gemini 3 Pro struggle to reliably implement the intended fill-and-repair mechanisms with complex algorithmic designs, highlighting fundamental challenges in LLM code generation and reasoning. The code of Isabellm is available at https://github.com/zhehou/llm-isabelle
翻译:我们提出了Isabellm,一种用于Isabelle/HOL的LLM驱动定理证明器,能够执行全自动的证明合成。Isabellm可与Ollama上的任何本地LLM以及Gemini CLI等API协同工作,并设计为可在消费级计算机上运行。该系统结合了一个逐步证明器(它使用大语言模型在有限搜索循环中提出由Isabelle验证的证明命令)和一个更高级别的证明规划器(该规划器生成结构化的Isar大纲,并尝试填补和修复剩余的缺口)。该框架包括用于策略的束搜索、策略重排器ML和RL模型、基于小型Transformer模型的前提选择、从AFP构建的Isar证明的微RAG,以及反例引导的证明修复。所有代码均由GPT 4.1 - 5.2、Gemini 3 Pro和Claude 4.5实现。实验表明,Isabellm能够证明一些Isabelle标准自动化工具(包括Sledgehammer)无法处理的引理,这证明了LLM引导的证明搜索具有实用价值。同时,我们发现即使是像GPT 5.2 Extended Thinking和Gemini 3 Pro这样的最先进LLM,也难以可靠地实现具有复杂算法设计的预期填补与修复机制,这突显了LLM代码生成与推理方面的根本性挑战。Isabellm的代码可在https://github.com/zhehou/llm-isabelle获取。