Large language models (LLMs) have shown increasing capability in problem-solving and decision-making, largely based on the step-by-step chain-of-thought reasoning processes. However, evaluating these reasoning abilities has become increasingly challenging. Existing outcome-based benchmarks are beginning to saturate, becoming less effective in tracking meaningful progress. To address this, we present a process-based benchmark MR-Ben that demands a meta-reasoning skill, where LMs are asked to locate and analyse potential errors in automatically generated reasoning steps. Our meta-reasoning paradigm is especially suited for system-2 slow thinking, mirroring the human cognitive process of carefully examining assumptions, conditions, calculations, and logic to identify mistakes.MR-Ben comprises 5,975 questions curated by human experts across a wide range of subjects, including physics, chemistry, logic, coding, and more. Through our designed metrics for assessing meta-reasoning on this benchmark, we identify interesting limitations and weaknesses of current LLMs (open-source and closed-source models). For example, with models like the o1 series from OpenAI demonstrating strong performance by effectively scrutinizing the solution space, many other state-of-the-art models fall significantly behind on MR-Ben, exposing potential shortcomings in their training strategies and inference methodologies.
翻译:大语言模型在问题解决和决策制定方面展现出日益增强的能力,这主要基于其逐步的思维链推理过程。然而,评估这些推理能力正变得愈发具有挑战性。现有的基于结果的基准测试正趋于饱和,在追踪有意义的进展方面效果减弱。为解决这一问题,我们提出了一个基于过程的基准测试MR-Ben,它要求具备元推理技能,即要求语言模型定位并分析自动生成的推理步骤中的潜在错误。我们的元推理范式特别适用于系统二的慢思考,它模拟了人类仔细检查假设、条件、计算和逻辑以识别错误的认知过程。MR-Ben包含由人类专家精心策划的5,975个问题,涵盖广泛的学科领域,包括物理、化学、逻辑、编程等。通过我们在该基准上设计的用于评估元推理的指标,我们发现了当前大语言模型(包括开源和闭源模型)存在的一些有趣的局限性和弱点。例如,尽管像OpenAI的o1系列模型通过有效审视解空间而表现出色,但许多其他最先进的模型在MR-Ben上的表现却显著落后,这暴露了它们在训练策略和推理方法上存在的潜在不足。