We introduce KnowledgeMath, a novel benchmark designed to evaluate LLMs' capabilities in applying financial knowledge to solve complex math word problems. Compared to prior works, this study features three core advancements. First, KnowledgeMath includes 1,259 problems with a hybrid of textual and tabular content and require college-level knowledge in the finance domain for effective resolution. Second, we provide expert-annotated, detailed solution references in Python program format, ensuring a high-quality benchmark for LLM assessment. Finally, we evaluate a wide spectrum of 14 LLMs with different prompting strategies like Chain-of-Thoughts and Program-of-Thoughts. The current best-performing system (i.e., GPT-4 with Program-of-Thoughts) achieves only 45.4% accuracy, leaving substantial room for improvement. While knowledge-augmented LLMs can improve the performance (e.g., from 23.9% to 32.0% for GPT-3.5), it is still significantly lower the estimated human expert performance of 94%. We believe that KnowledgeMath can facilitate future research on domain-specific knowledge retrieval and augmentation into the math word problem-solving process. We will release the benchmark and code at https://github.com/yale-nlp/KnowledgeMath.
翻译:我们提出了KnowledgeMath,这是一个新颖的基准测试,旨在评估大语言模型在运用金融知识解决复杂数学应用题方面的能力。与前期工作相比,本研究具有三项核心创新。首先,KnowledgeMath包含1,259道兼具文本与表格内容的题目,需要具备金融领域的大学水平知识才能有效解答。其次,我们提供了专家标注的详细解题参考(以Python程序格式呈现),确保该基准测试能高质量评估大语言模型。最后,我们评估了14种大语言模型在不同提示策略(如思维链、程序链)下的表现。当前最佳系统(即GPT-4配合程序链)仅达到45.4%的准确率,仍有显著提升空间。尽管知识增强型大语言模型可提升性能(例如GPT-3.5从23.9%提升至32.0%),但仍远低于专家人类94%的估计准确率。我们相信KnowledgeMath能推动数学应用题求解过程中领域知识检索与增强的未来研究。我们将于https://github.com/yale-nlp/KnowledgeMath 公开发布该基准测试及代码。