Recent advancements in Large Language Models (LLMs) and Multi-Modal Models (MMs) have demonstrated their remarkable capabilities in problem-solving. Yet, their proficiency in tackling geometry math problems, which necessitates an integrated understanding of both textual and visual information, has not been thoroughly evaluated. To address this gap, we introduce the GeoEval benchmark, a comprehensive collection that includes a main subset of 2000 problems, a 750 problem subset focusing on backward reasoning, an augmented subset of 2000 problems, and a hard subset of 300 problems. This benchmark facilitates a deeper investigation into the performance of LLMs and MMs on solving geometry math problems. Our evaluation of ten LLMs and MMs across these varied subsets reveals that the WizardMath model excels, achieving a 55.67\% accuracy rate on the main subset but only a 6.00\% accuracy on the challenging subset. This highlights the critical need for testing models against datasets on which they have not been pre-trained. Additionally, our findings indicate that GPT-series models perform more effectively on problems they have rephrased, suggesting a promising method for enhancing model capabilities.
翻译:近期,大语言模型(LLMs)与多模态模型(MMs)在问题求解领域展现出卓越能力。然而,它们对需综合理解文本与视觉信息的几何数学问题的解题能力尚未被充分评估。为填补这一空白,我们提出GeoEval基准测试——该综合性数据集包含2000道题的主子集、750道聚焦逆向推理的子集、2000道题的增强子集以及300道题的困难子集。该基准测试为深入探究LLMs与MMs在几何数学问题上的表现提供了平台。我们对十个LLMs与MMs在多个子集上的评估显示,WizardMath模型表现最优:主子集准确率达55.67%,但困难子集准确率仅6.00%。这一结果凸显了在模型未预训练数据集上进行测试的必要性。此外,研究发现GPT系列模型在重新表述的问题上表现更佳,这为提高模型能力提供了新路径。