We propose a minimal agentic baseline that enables systematic comparison across different AI-based theorem prover architectures. This design implements the core features shared among state-of-the-art systems: iterative proof refinement, library search and context management. We evaluate our baseline using qualitatively different benchmarks and compare various popular models and design choices, and demonstrate competitive performance compared to state-of-the-art approaches, while using a significantly simpler architecture. Our results demonstrate consistent advantages of an iterative approach over multiple single-shot generations, especially in terms of sample efficiency and cost effectiveness. The implementation is released open-source as a candidate reference for future research and as an accessible prover for the community.
翻译:我们提出了一种最小化的智能体基线,使得能够系统性地比较不同基于人工智能的定理证明器架构。该设计实现了最先进系统共有的核心特性:迭代式证明精化、库搜索与上下文管理。我们使用性质不同的基准测试评估该基线,比较了多种流行模型与设计选择,并证明了其在采用显著简化架构的同时,仍能达到与最先进方法相竞争的性能。我们的结果表明,相较于多次单次生成,迭代式方法在样本效率与成本效益方面具有持续优势。该实现已作为开源项目发布,旨在为未来研究提供一个候选参考,并为社区提供一个易于使用的证明器。