An impediment to using Large Language Models (LLMs) for reasoning output verification is that LLMs struggle to reliably identify errors in thinking traces, particularly in long outputs, domains requiring expert knowledge, and problems without verifiable rewards. We propose a data-driven approach to automatically construct highly granular reasoning error taxonomies to enhance LLM-driven error detection on unseen reasoning traces. Our findings indicate that classification approaches that leverage these error taxonomies, or "rubrics", demonstrate strong error identification compared to baseline methods in technical domains like coding, math, and chemical engineering. These rubrics can be used to build stronger LLM-as-judge reward functions for reasoning model training via reinforcement learning. Experimental results show that these rewards have the potential to improve models' task accuracy on difficult domains over models trained by general LLMs-as-judges by +45%, and approach performance of models trained by verifiable rewards while using as little as 20% as many gold labels. Through our approach, we extend the usage of reward rubrics from assessing qualitative model behavior to assessing quantitative model correctness on tasks typically learned via RLVR rewards. This extension opens the door for teaching models to solve complex technical problems without a full dataset of gold labels, which are often highly costly to procure.
翻译:使用大型语言模型(LLM)进行推理输出验证的一个障碍在于,LLM难以可靠地识别思维轨迹中的错误,尤其是在长输出、需要专家知识的领域以及缺乏可验证奖励的问题中。我们提出了一种数据驱动的方法,用于自动构建高度细粒度的推理错误分类体系,以增强LLM对未见推理轨迹的错误检测能力。研究结果表明,在编程、数学和化学工程等技术领域中,利用这些错误分类体系(或称“评分标准”)的分类方法,相较于基线方法展现出更强的错误识别能力。这些评分标准可用于构建更强大的“LLM即评委”奖励函数,以通过强化学习训练推理模型。实验结果显示,与由通用LLM即评委训练的模型相比,这些奖励函数在困难领域任务上的模型准确率可提升+45%,并且在使用仅20%的黄金标签数据时,其性能接近由可验证奖励训练的模型。通过我们的方法,我们将奖励评分标准的应用从评估定性模型行为扩展到评估通常通过RLVR奖励学习的任务中的定量模型正确性。这一扩展为在无需完整黄金标签数据集(其获取成本通常极高)的情况下,训练模型解决复杂技术问题开辟了新途径。