We introduce Mathador-LM, a new benchmark for evaluating the mathematical reasoning on large language models (LLMs), combining ruleset interpretation, planning, and problem-solving. This benchmark is inspired by the Mathador game, where the objective is to reach a target number using basic arithmetic operations on a given set of base numbers, following a simple set of rules. We show that, across leading LLMs, we obtain stable average performance while generating benchmark instances dynamically, following a target difficulty level. Thus, our benchmark alleviates concerns about test-set leakage into training data, an issue that often undermines popular benchmarks. Additionally, we conduct a comprehensive evaluation of both open and closed-source state-of-the-art LLMs on Mathador-LM. Our findings reveal that contemporary models struggle with Mathador-LM, scoring significantly lower than average 3rd graders. This stands in stark contrast to their strong performance on popular mathematical reasoning benchmarks.
翻译:我们提出了Mathador-LM,这是一个用于评估大语言模型(LLMs)数学推理能力的新基准,它结合了规则集解读、规划与问题求解。该基准的设计灵感来源于Mathador游戏,其目标是在遵循一套简单规则的前提下,使用一组给定的基本数字通过基础算术运算得到目标数字。研究表明,在领先的LLMs上,我们能够按照目标难度等级动态生成基准测试实例,同时获得稳定的平均性能。因此,我们的基准缓解了测试集泄露至训练数据所引发的担忧——这一问题常会削弱流行基准的有效性。此外,我们对开源与闭源的先进LLMs在Mathador-LM上进行了全面评估。我们的发现表明,当前模型在Mathador-LM上表现欠佳,其得分显著低于小学三年级学生的平均水平。这与它们在主流数学推理基准上的优异表现形成了鲜明对比。