Recent LLMs have demonstrated remarkable performance in solving exam-like math word problems. However, the degree to which these numerical reasoning skills are effective in real-world scenarios, particularly in expert domains, is still largely unexplored. This paper introduces DocMath-Eval, a comprehensive benchmark specifically designed to evaluate the numerical reasoning and problem-solving capabilities of LLMs in the context of understanding and analyzing financial documents containing both text and tables. We evaluate a wide spectrum of 19 LLMs, including those specialized in coding and finance. We also incorporate different prompting strategies (i.e., Chain-of-Thoughts and Program-of-Thoughts) to comprehensively assess the capabilities and limitations of existing LLMs in DocMath-Eval. We found that, although the current best-performing system (i.e., GPT-4), can perform well on simple problems such as calculating the rate of increase in a financial metric within a short document context, it significantly lags behind human experts in more complex problems grounded in longer contexts. We believe DocMath-Eval can be used as a valuable benchmark to evaluate LLMs' capabilities to solve challenging numerical reasoning problems in expert domains. We will release the benchmark and code at https://github.com/yale-nlp/DocMath-Eval.
翻译:近期的大语言模型在解决类似考试题型的数学应用题方面展现出卓越性能。然而,这些数值推理技能在真实场景(尤其是专业领域)中的有效性仍鲜有探究。本文提出DocMath-Eval,一个专门设计用于评估大语言模型在理解与分析含文本和表格的金融文档时数值推理与问题解决能力的综合基准。我们评估了涵盖编码与金融等领域的19种大语言模型,并采用多种提示策略(即思维链与程序链),全面评估现有大语言模型在DocMath-Eval上的能力与局限。研究发现,尽管当前性能最优的系统(如GPT-4)能较好完成短文档情境下的简单问题(如计算财务指标的增长率),但在涉及长上下文且更复杂的问题上,其表现显著落后于人类专家。我们相信DocMath-Eval可作为评估大语言模型解决专业领域数值推理难题的重要基准。基准与代码将在https://github.com/yale-nlp/DocMath-Eval开源。