The current evaluation of mathematical skills in LLMs is limited, as existing benchmarks are either relatively small, primarily focus on elementary and high-school problems, or lack diversity in topics. Additionally, the inclusion of visual elements in tasks remains largely under-explored. To address these gaps, we introduce U-MATH, a novel benchmark of 1,100 unpublished open-ended university-level problems sourced from teaching materials. It is balanced across six core subjects, with 20% of multimodal problems. Given the open-ended nature of U-MATH problems, we employ an LLM to judge the correctness of generated solutions. To this end, we release $\mu$-MATH, a dataset to evaluate the LLMs' capabilities in judging solutions. The evaluation of general domain, math-specific, and multimodal LLMs highlights the challenges presented by U-MATH. Our findings reveal that LLMs achieve a maximum accuracy of only 63% on text-based tasks, with even lower 45% on visual problems. The solution assessment proves challenging for LLMs, with the best LLM judge having an F1-score of 80% on $\mu$-MATH.
翻译:当前对大型语言模型(LLM)数学能力的评估存在局限,现有基准要么规模较小,主要关注中小学阶段问题,要么在主题多样性上有所欠缺。此外,任务中视觉元素的融入仍普遍缺乏深入探索。为填补这些空白,我们提出了U-MATH——一个包含1,100道未公开大学水平开放式题目的新型基准数据集,题目来源于教学材料。该数据集均衡覆盖六个核心学科,其中20%为多模态问题。鉴于U-MATH问题的开放式特性,我们采用LLM对生成的解答进行正确性判定。为此,我们发布了$\mu$-MATH数据集,用于评估LLM在解答评判方面的能力。通过对通用领域、数学专用及多模态LLM的评估,凸显了U-MATH带来的挑战。研究结果表明,LLM在纯文本任务上的最高准确率仅为63%,而在视觉问题上的表现更差,仅为45%。解答评估对LLM而言尤为困难,在$\mu$-MATH数据集上表现最佳的LLM评判器F1分数仅为80%。