Recent advances in large language models have demonstrated impressive capabilities in mathematical formalization. However, existing benchmarks focus on logical verification of declarative propositions, often neglecting the task of explicitly synthesizing solutions. This limitation is particularly acute in applied mathematics domains, where the goal is frequently to derive concrete values or executable algorithms rather than solely proving theorems. To address this, we introduce a Lean 4 framework that enforces a construction-verification workflow, compelling the agent to define explicit solutions before proving their correctness. We curate a comprehensive benchmark AMBER (Applied Mathematics BEnchmark for Reasoning) spanning core domains of applied mathematics, including convex analysis, optimization, numerical algebra, and high-dimensional probability. Aside from theorem proving, our benchmark features complex tasks such as evaluation, algorithm design, and representation transformation. Experiments reveal that current models face significant difficulties with these constructive tasks. Notably, we observe that general-purpose reasoning models consistently outperform specialized theorem provers. We attribute this to a degradation of instruction following capabilities in specialized models. Fine-tuning on proof corpora appears to induce ``tactical overfitting", compromising the ability to adhere to complex constructive requirements, whereas general models retain the versatility needed for multi-task formal reasoning.
翻译:近期大型语言模型的进展在数学形式化方面展现了令人印象深刻的能力。然而,现有基准主要关注陈述性命题的逻辑验证,往往忽略了显式综合解决方案的任务。这一局限在应用数学领域尤为突出,因为其目标通常是推导具体数值或可执行算法,而不仅仅是证明定理。为解决此问题,我们引入了一个 Lean 4 框架,该框架强制实施构造-验证工作流,要求智能体在证明其正确性之前先定义显式解。我们构建了一个全面的基准 AMBER(应用数学推理基准),涵盖应用数学的核心领域,包括凸分析、优化、数值代数和高维概率。除定理证明外,我们的基准还包含评估、算法设计和表示变换等复杂任务。实验表明,当前模型在处理这些构造性任务时面临显著困难。值得注意的是,我们观察到通用推理模型的表现持续优于专用定理证明器。我们将此归因于专用模型中指令遵循能力的退化。在证明语料上进行微调似乎会导致“策略过拟合”,从而损害其遵循复杂构造性要求的能力,而通用模型则保留了多任务形式推理所需的灵活性。