Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning capabilities of LLMs in the context of understanding and analyzing specialized documents containing both text and tables. We evaluate a wide spectrum of 48 LLMs with Chain-of-Thought and Program-of-Thought prompting methods, aiming to comprehensively assess the capabilities and limitations of existing LLMs in DocMath-Eval. We found that even the current best-performing system (i.e., GPT-4o) still significantly lags behind human experts in solving complex numerical reasoning problems grounded in long contexts. We believe that DocMath-Eval can serve as a valuable benchmark for evaluating LLMs' capabilities in solving challenging numerical reasoning problems within expert domains.
翻译:近期的大语言模型在解决考试类数学应用题方面展现出卓越性能。然而,这些数值推理技能在现实场景,尤其是在专业领域中的有效程度,在很大程度上仍未得到充分探索。本文介绍了DocMath-Eval,这是一个专门设计的综合性基准测试,旨在评估大语言模型在理解和分析包含文本与表格的专业文档时的数值推理能力。我们使用思维链与程序链提示方法,评估了涵盖广泛的48个大语言模型,旨在全面评估现有大语言模型在DocMath-Eval上的能力与局限。我们发现,即使是当前性能最佳的系统(即GPT-4o),在解决基于长上下文的复杂数值推理问题时,仍显著落后于人类专家。我们相信,DocMath-Eval可以作为一个有价值的基准,用于评估大语言模型在专业领域内解决具有挑战性的数值推理问题的能力。