LongCat-Flash-Prover: Advancing Native Formal Reasoning via Agentic Tool-Integrated Reinforcement Learning

Jianing Wang,Jianfei Zhang,Qi Guo,Linsen Guo,Rumei Li,Chao Zhang,Chong Peng,Cunguang Wang,Dengchang Zhao,Jiarong Shi,Jingang Wang,Liulin Feng,Mengxia Shen,Qi Li,Shengnan An,Shun Wang,Wei Shi,Xiangyu Xi,Xiaoyu Li,Xuezhi Cao,Yi Lu,Yunke Zhao,Zhengyu Chen,Zhimin Lin,Wei Wang,Peng Pei,Xunliang Cai

from arxiv, 43 pages, 5 figures

We introduce LongCat-Flash-Prover, a flagship 560-billion-parameter open-source Mixture-of- Experts (MoE) model that advances Native Formal Reasoning in Lean4 through agentic tool-integrated reasoning (TIR). We decompose the native formal reasoning task into three independent formal capabilities, i.e., auto-formalization, sketching, and proving. To facilitate these capabilities, we propose a Hybrid-Experts Iteration Framework to expand high-quality task trajectories, including generating a formal statement based on a given informal problem, producing a whole-proof directly from the statement, or a lemma-style sketch. During agentic RL, we present a Hierarchical Importance Sampling Policy Optimization (HisPO) algorithm, which aims to stabilize the MoE model training on such long-horizon tasks. It employs a gradient masking strategy that accounts for the policy staleness and the inherent train-inference engine discrepancies at both sequence and token levels. Additionally, we also incorporate theorem consistency and legality detection mechanisms to eliminate reward hacking issues. Extensive evaluations show that our LongCat-Flash-Prover sets a new state-of-the-art for open-weights models in both auto-formalization and theorem proving. Demonstrating remarkable sample efficiency, it achieves a 97.1% pass rate on MiniF2F-Test using only 72 inference budget per problem. On more challenging benchmarks, it solves 70.8% of ProverBench and 41.5% of PutnamBench with no more than 220 attempts per problem, significantly outperforming existing open-weights baselines.

翻译：我们提出LongCat-Flash-Prover，一款旗舰级5600亿参数开源混合专家（MoE）模型，通过智能体工具集成推理（TIR）在Lean4中推进原生形式化推理。将原生形式化推理任务分解为三个独立的形式化能力，即自动形式化、草图编写与证明。为支撑这些能力，我们提出混合专家迭代框架以扩展高质量任务轨迹，包括基于给定非形式化问题生成形式化陈述、直接从陈述生成完整证明或引理式草图。在智能体强化学习过程中，我们提出分层重要性采样策略优化（HisPO）算法，旨在稳定MoE模型在此类长时任务上的训练。该算法采用梯度掩码策略，同时考虑序列级和令牌级上的策略过时性及固有的训练-推理引擎差异。此外，我们还引入定理一致性与合法性检测机制以消除奖励破解问题。大量评估表明，我们的LongCat-Flash-Prover在自动形式化与定理证明中均创下开源权重模型的最新水平。其展示出卓越的样本效率，在仅使用每个问题72次推理预算的情况下，于MiniF2F-Test上达到97.1%的通过率。在更具挑战性的基准测试中，它在每个问题不超过220次尝试的条件下，解决了ProverBench中70.8%的问题及PutnamBench中41.5%的问题，显著优于现有开源权重基线模型。