Large language models (LLMs) excel in many natural language tasks, yet they struggle with complex mathemat-ical problem-solving, particularly in symbolic reasoning and maintaining consistent output. This study evalu-ates 10 LLMs with 7 to 8 billion parameters using 945 competition-level problems from the MATH dataset. The focus is on their ability to generate executable Python code as a step in their reasoning process, involving over 9,450 code executions. The research introduces an evaluation framework using mistral-large-2411 to rate answers on a 5-point scale, which helps address inconsistencies in mathematical notation. It also examines the impact of regenerating output token-by-token on refining results. The findings reveal a significant 34.5% per-formance gap between the top commercial model (gpt-4o-mini, scoring 83.7%) and the least effective open-source model (open-codestral-mamba:v0.1, scoring 49.2%). This disparity is especially noticeable in complex areas like Number Theory. While token-by-token regeneration slightly improved accuracy (+0.8%) for the model llama3.1:8b, it also reduced code execution time by 36.7%, highlighting a trade-off between efficiency and precision. The study also noted a consistent trend where harder problems correlated with lower accuracy across all models. Despite using controlled execution environments, less than 1% of the generated code was unsafe, and 3.17% of problems remained unsolved after 10 attempts, suggesting that hybrid reasoning methods may be beneficial.
翻译:大语言模型(LLMs)在众多自然语言任务中表现卓越,但在复杂数学问题求解方面仍面临挑战,尤其是在符号推理与输出一致性方面。本研究使用MATH数据集中945道竞赛级题目,评估了10个参数量为70亿至80亿的大语言模型。研究重点考察模型生成可执行Python代码作为推理过程环节的能力,涉及超过9,450次代码执行。该研究引入基于mistral-large-2411的评估框架,采用5分量表对答案进行评分,以应对数学符号表达不一致的问题。同时探究了逐令牌再生输出对结果优化的影响。研究发现,表现最佳的商业模型(gpt-4o-mini,得分83.7%)与效果最差的开源模型(open-codestral-mamba:v0.1,得分49.2%)之间存在34.5%的显著性能差距,在数论等复杂领域尤为明显。虽然逐令牌再生使llama3.1:8b模型的准确率略有提升(+0.8%),但代码执行时间也减少了36.7%,揭示了效率与精度间的权衡关系。研究还发现所有模型均呈现难题对应低准确率的稳定趋势。尽管在受控执行环境中,生成的不安全代码不足1%,但仍有3.17%的问题在10次尝试后未能解决,这表明混合推理方法可能具有应用价值。