While large language models (LLMs) have demonstrated impressive capabilities in formal theorem proving, current benchmarks fail to adequately measure library-grounded abstraction -- the ability to reason with high-level interfaces and reusable structures central to modern mathematics and software engineering. We introduce LeanCat, a challenging benchmark comprising 100 fully formalized category-theory tasks in Lean. Unlike algebra or arithmetic, category theory serves as a rigorous stress test for structural, interface-level reasoning. Our evaluation reveals a severe abstraction gap: the best state-of-the-art model solves only 12.0% of tasks at pass@4, with performance collapsing from 55.0% on Easy tasks to 0.0% on High-difficulty tasks, highlighting a failure in compositional generalization. To overcome this, we evaluate LeanBridge, a retrieval-augmented agent that employs a retrieve-generate-verify loop. LeanBridge achieves a peak success rate of 24.0% -- doubling the performance of the best static baseline. These results empirically demonstrate that iterative refinement and dynamic library retrieval are not merely optimizations but strict necessities for neuro-symbolic reasoning in abstract domains. LeanCat offers a compact, reusable testbed for tracking progress toward reliable, research-level formalization.
翻译:尽管大型语言模型(LLM)在形式化定理证明中展现了令人印象深刻的能力,但当前的基准测试未能充分衡量基于库的抽象能力——即运用现代数学与软件工程核心的高层接口和可复用结构进行推理的能力。我们提出了LeanCat,这是一个具有挑战性的基准测试,包含100个在Lean中完全形式化的范畴论任务。与代数或算术不同,范畴论为结构化和接口层面的推理提供了严格的压力测试。我们的评估揭示了一个严重的抽象鸿沟:最先进的模型在pass@4下仅能解决12.0%的任务,其性能从简单任务的55.0%骤降至高难度任务的0.0%,凸显了组合泛化能力的不足。为克服此问题,我们评估了LeanBridge,这是一个采用检索-生成-验证循环的检索增强智能体。LeanBridge实现了24.0%的峰值成功率——将最佳静态基线的性能提升了一倍。这些结果从经验上证明,迭代精化和动态库检索不仅是优化手段,更是抽象领域中神经符号推理的严格必需条件。LeanCat为追踪迈向可靠、研究级形式化的进展提供了一个紧凑、可复用的测试平台。