Lemmanaid: Neuro-Symbolic Lemma Conjecturing

Mathematicians and computer scientists are increasingly using proof assistants to formalize and check correctness of complex proofs. This is a non-trivial task in itself, however, with high demands on human expertise. Can we lower the bar by introducing automation for conjecturing helpful, interesting and novel lemmas? We present the first neuro-symbolic lemma conjecturing tool, LEMMANAID, designed to discover conjectures by drawing analogies between mathematical theories. LEMMANAID uses a fine-tuned LLM to generate lemma templates that describe the shape of a lemma, and symbolic methods to fill in the details. We compare LEMMANAID against the same LLM fine-tuned to generate complete lemma statements (a purely neural method), as well as a fully symbolic conjecturing method. LEMMANAID consistently outperforms both neural and symbolic methods on test sets from Isabelle's HOL library and from its Archive of Formal Proofs (AFP). Using DeepSeek-coder-6.7B as a backend, LEMMANAID discovers 50% (HOL) and 28% (AFP) of the gold standard reference lemmas, 8-13% more than the corresponding neural-only method. Ensembling two LEMMANAID versions with different prompting strategies further increases performance to 55% and 34% respectively. In a case study on the formalization of Octonions, LEMMANAID discovers 79% of the gold standard lemmas, compared to 62% for neural-only and 23% for the state of the art symbolic tool. Our result show that LEMMANAID is able to conjecture a significant number of interesting lemmas across a wide range of domains covering formalizations over complex concepts in both mathematics and computer science, going far beyond the basic concepts of standard benchmarks such as miniF2F, PutnamBench and ProofNet.

翻译：数学家和计算机科学家正日益使用证明辅助工具来形式化并验证复杂证明的正确性。然而，这本身是一项非平凡的任务，对人类专业知识的要求很高。我们能否通过引入自动化机制来猜想有益、有趣且新颖的引理，从而降低这一门槛？本文提出了首个神经符号引理猜想工具 LEMMANAID，其设计目标是通过类比不同数学理论来发现猜想。LEMMANAID 使用经过微调的大语言模型生成描述引理结构的模板，并采用符号化方法填充具体细节。我们将 LEMMANAID 与经过相同微调以生成完整引理陈述的纯神经方法，以及完全符号化的猜想方法进行了比较。在 Isabelle 的 HOL 库及其形式化证明档案库的测试集上，LEMMANAID 始终优于神经方法和符号方法。以 DeepSeek-coder-6.7B 为后端时，LEMMANAID 发现了 50%（HOL）和 28%（AFP）的黄金标准参考引理，比对应的纯神经方法高出 8-13%。通过集成采用不同提示策略的两个 LEMMANAID 版本，性能进一步提升至 55% 和 34%。在八元数形式化的案例研究中，LEMMANAID 发现了 79% 的黄金标准引理，而纯神经方法和当前最先进的符号化工具分别仅发现 62% 和 23%。我们的结果表明，LEMMANAID 能够在涵盖数学和计算机科学中复杂概念形式化的广泛领域中，猜想出大量具有意义的引理，其能力远超 miniF2F、PutnamBench 和 ProofNet 等标准基准测试所涉及的基础概念。