The ability to precisely derive mathematical objects is a core requirement for downstream STEM applications, including mathematics, physics, and chemistry, where reasoning must culminate in formally structured expressions. Yet, current LM evaluations of mathematical and scientific reasoning rely heavily on simplified answer formats such as numerical values or multiple choice options due to the convenience of automated assessment. In this paper we provide three contributions for improving reasoning over mathematical objects: (i) we build and release training data and benchmarks for deriving mathematical objects, the Principia suite; (ii) we provide training recipes with strong LLM-judges and verifiers, where we show that on-policy judge training boosts performance; (iii) we show how on-policy training can also be used to scale test-time compute via aggregation. We find that strong LMs such as Qwen3-235B and o3 struggle on Principia, while our training recipes can bring significant improvements over different LLM backbones, while simultaneously improving results on existing numerical and MCQA tasks, demonstrating cross-format generalization of reasoning abilities.
翻译:精确推导数学对象的能力是下游STEM应用(包括数学、物理和化学)的核心要求,这些领域的推理必须最终形成形式化表达式结构。然而,当前对数学和科学推理的语言模型评估主要依赖简化答案格式(如数值或多项选择),这是因为自动化评估的便捷性需求。本文为提高面向数学对象的推理能力做出三项贡献:(i)构建并发布用于推导数学对象的训练数据与基准——Principia套件;(ii)提供结合强大LLM评判器与验证器的训练配方,证明基于策略的评判器训练能提升性能;(iii)展示如何利用基于策略的训练通过聚合技术扩展测试时计算规模。我们发现Qwen3-235B与o3等强大语言模型在Principia基准上表现欠佳,而我们的训练配方能在不同LLM骨干网络上带来显著改进,同时提升现有数值与多项选择问答任务的结果,展现出推理能力的跨格式泛化特性。