In recent progress, mathematical verifiers have achieved success in mathematical reasoning tasks by validating the correctness of solutions generated by policy models. However, existing verifiers are trained with binary classification labels, which are not informative enough for the model to accurately assess the solutions. To mitigate the aforementioned insufficiency of binary labels, we introduce step-wise natural language feedback as rationale labels, that is, the correctness of each step and the detailed explanations. In this paper, we propose Math-Minos, a natural language feedback-enhanced verifier by constructing automatically generated training data and a two-stage training paradigm for effective training and efficient inference. Our experiments reveal that a small set of natural language feedback can significantly boost the performance of the verifier in both verification and reinforcement learning. We have released the code and data for further exploration.
翻译:近年来,数学验证器通过评估策略模型生成解的正确性,在数学推理任务中取得了显著进展。然而,现有验证器主要使用二元分类标签进行训练,这类标签所包含的信息有限,难以支撑模型对解答进行精确评估。为缓解二元标签的信息不足问题,我们引入分步自然语言反馈作为推理依据标签,即包含每一步的正确性判断与详细解释。本文提出Math-Minos——一种通过自动生成训练数据与两阶段训练范式构建的自然语言反馈增强型验证器,该设计兼顾训练效率与推理效能。实验表明,少量自然语言反馈即可显著提升验证器在验证任务与强化学习中的性能。我们已公开相关代码与数据以供进一步研究。