基于逐令牌再生与领域偏差：大语言模型在高级数学问题求解上的基准评估 (Token-by-Token Regeneration and Domain Biases: A Benchmark of LLMs on Advanced Mathematical Problem-Solving)

Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathemat-ical problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evalu-ates 10 LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regenerating output token-by-token on refining results. The findings reveal a significant 34.5% per-formance gap between the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number Theory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems remained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.

翻译：大语言模型（LLMs）在众多自然语言任务中表现卓越，但在复杂数学问题求解方面仍面临挑战，尤其是在符号推理与输出一致性方面。本研究使用MATH数据集中945道竞赛级题目，评估了10个参数量为70亿至80亿的大语言模型。研究重点考察模型生成可执行Python代码作为推理过程环节的能力，涉及超过9,450次代码执行。该研究引入基于mistral-large-2411的评估框架，采用5分量表对答案进行评分，以应对数学符号表达不一致的问题。同时探究了逐令牌再生输出对结果优化的影响。研究发现，表现最佳的商业模型（gpt-4o-mini，得分83.7%）与效果最差的开源模型（open-codestral-mamba:v0.1，得分49.2%）之间存在34.5%的显著性能差距，在数论等复杂领域尤为明显。虽然逐令牌再生使llama3.1:8b模型的准确率略有提升（+0.8%），但代码执行时间也减少了36.7%，揭示了效率与精度间的权衡关系。研究还发现所有模型均呈现难题对应低准确率的稳定趋势。尽管在受控执行环境中，生成的不安全代码不足1%，但仍有3.17%的问题在10次尝试后未能解决，这表明混合推理方法可能具有应用价值。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日