Mamo: a Mathematical Modeling Benchmark with Solvers

from arxiv, Project: https://github.com/FreedomIntelligence/Mamo Updates: 1. include more models 2. minor modification of the metric with new results 3. fix some typos 4. add error analysis with examples

Mathematical modeling involves representing real-world phenomena, systems, or problems using mathematical expressions and equations to analyze, understand, and predict their behavior. Given that this process typically requires experienced experts, there is an interest in exploring whether Large Language Models (LLMs) can undertake mathematical modeling to potentially decrease human labor. To evaluate of LLMs in mathematical modeling, we introduce a new benchmark, Mamo, that transcends traditional result-oriented assessments. Unlike conventional methods that primarily assess LLMs based on the accuracy of solutions to mathematical problems, our approach offers deeper insight into the modeling process itself. By focusing on the processes LLMs undertake rather than the correctness of their final solutions, Mamo pioneers a novel evaluation paradigm. This shift underscores the importance of understanding the inherent modeling capabilities of LLMs, paving the way for a more nuanced and comprehensive analysis of their problem-solving strategies. Our work marks a significant advancement in the field, suggesting a new direction for future research by emphasizing the evaluation of LLMs' modeling processes over the mere correctness of answers. This benchmark not only facilitates a better understanding of LLMs' mathematical modeling capabilities but also sets a new standard for evaluating their performance in complex problem-solving scenarios.

翻译：数学建模涉及使用数学表达式和方程来表示现实世界的现象、系统或问题，以分析、理解并预测其行为。鉴于这一过程通常需要经验丰富的专家，学界开始探索大型语言模型（LLMs）是否能够承担数学建模任务，从而潜在地减少人力投入。为了评估LLMs在数学建模中的表现，我们引入了一个超越传统结果导向评估的新基准——Mamo。与主要基于数学问题求解准确性来评估LLMs的传统方法不同，我们的方法能更深入地洞察建模过程本身。通过聚焦于LLMs所执行的建模过程而非最终解的正确性，Mamo开创了一种新颖的评估范式。这一转变强调了理解LLMs内在建模能力的重要性，为更细致、全面地分析其问题解决策略铺平了道路。我们的工作标志着该领域的重大进展，通过强调对LLMs建模过程的评估而非仅关注答案的正确性，为未来研究提出了新方向。该基准不仅有助于更好地理解LLMs的数学建模能力，还为评估其在复杂问题解决场景中的表现设立了新标准。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日