Mathematical modeling involves representing real-world phenomena, systems, or problems using mathematical expressions and equations to analyze, understand, and predict their behavior. Given that this process typically requires experienced experts, there is an interest in exploring whether Large Language Models (LLMs) can undertake mathematical modeling to potentially decrease human labor. To evaluate of LLMs in mathematical modeling, we introduce a new benchmark, Mamo, that transcends traditional result-oriented assessments. Unlike conventional methods that primarily assess LLMs based on the accuracy of solutions to mathematical problems, our approach offers deeper insight into the modeling process itself. By focusing on the processes LLMs undertake rather than the correctness of their final solutions, Mamo pioneers a novel evaluation paradigm. This shift underscores the importance of understanding the inherent modeling capabilities of LLMs, paving the way for a more nuanced and comprehensive analysis of their problem-solving strategies. Our work marks a significant advancement in the field, suggesting a new direction for future research by emphasizing the evaluation of LLMs' modeling processes over the mere correctness of answers. This benchmark not only facilitates a better understanding of LLMs' mathematical modeling capabilities but also sets a new standard for evaluating their performance in complex problem-solving scenarios.
翻译:数学建模涉及使用数学表达式和方程来表示现实世界的现象、系统或问题,以分析、理解并预测其行为。鉴于这一过程通常需要经验丰富的专家,学界开始探索大型语言模型(LLMs)是否能够承担数学建模任务,从而潜在地减少人力投入。为了评估LLMs在数学建模中的表现,我们引入了一个超越传统结果导向评估的新基准——Mamo。与主要基于数学问题求解准确性来评估LLMs的传统方法不同,我们的方法能更深入地洞察建模过程本身。通过聚焦于LLMs所执行的建模过程而非最终解的正确性,Mamo开创了一种新颖的评估范式。这一转变强调了理解LLMs内在建模能力的重要性,为更细致、全面地分析其问题解决策略铺平了道路。我们的工作标志着该领域的重大进展,通过强调对LLMs建模过程的评估而非仅关注答案的正确性,为未来研究提出了新方向。该基准不仅有助于更好地理解LLMs的数学建模能力,还为评估其在复杂问题解决场景中的表现设立了新标准。