With the continuous advancement of large language models (LLMs) in mathematical reasoning, evaluating their performance in this domain has become a prominent research focus. Recent studies have raised concerns about the reliability of current mathematical benchmarks, highlighting issues such as simplistic design and potential data leakage. Therefore, creating a reliable benchmark that effectively evaluates the genuine capabilities of LLMs in mathematical reasoning remains a significant challenge. To address this, we propose RV-Bench, a framework for Benchmarking LLMs via Random Variables in mathematical reasoning. Specifically, the background content of a random variable question (RV question) mirrors the original problem in existing standard benchmarks, but the variable combinations are randomized into different values. LLMs must fully understand the problem-solving process for the original problem to correctly answer RV questions with various combinations of variable values. As a result, the LLM's genuine capability in mathematical reasoning is reflected by its accuracy on RV-Bench. Extensive experiments are conducted with 29 representative LLMs across 900+ RV questions. A leaderboard for RV-Bench ranks the genuine capability of these LLMs. Further analysis of accuracy dropping indicates that current LLMs still struggle with complex mathematical reasoning problems.
翻译:随着大语言模型(LLM)在数学推理领域的持续进步,评估其在该领域的性能已成为一个重要的研究焦点。近期研究对当前数学基准测试的可靠性提出了担忧,指出了诸如设计过于简单和可能存在数据泄露等问题。因此,创建一个能够有效评估LLM在数学推理方面真实能力的可靠基准测试仍然是一个重大挑战。为此,我们提出了RV-Bench,一个通过随机变量在数学推理领域对大语言模型进行基准测试的框架。具体而言,随机变量问题(RV问题)的背景内容与现有标准基准测试中的原始问题保持一致,但变量组合被随机化为不同的取值。LLM必须完全理解原始问题的求解过程,才能正确回答具有各种变量值组合的RV问题。因此,LLM在数学推理方面的真实能力通过其在RV-Bench上的准确率得以体现。我们使用29个代表性LLM在超过900个RV问题上进行了广泛的实验。RV-Bench的排行榜对这些LLM的真实能力进行了排序。对准确率下降的进一步分析表明,当前的LLM在处理复杂的数学推理问题时仍然存在困难。