Reinforcement Learning with Verifiable Rewards(RLVR) has demonstrated great potential in enhancing the reasoning capabilities of large language models (LLMs). However, its success has thus far been largely confined to the mathematical and programming domains with clear and automatically checkable outcomes. Reinforcement learning on open-ended tasks (e.g., creative writing and subjective Q&A) continues to rely on reward models due to the absence of verifiable solutions. This raises a key question: how can we extend RLVR to strengthen reasoning in open-ended tasks regardless of the absence of the unambiguous ground truth? To overcome this challenge, we introduce Verifiable Multiple-Choice Reformulation for Reinforcement Learning from Verifiable Rewards (VMR-RLVR), a novel training strategy that restructures open-ended data into verifiable multiple-choice formats, enabling effective training even in the absence of explicit ground truth. Experimental results on multiple benchmarks validate the effectiveness of our method in improving LLM performance on open-ended tasks. Notably, across seven open-ended benchmarks, our VMR-RLVR training delivers an average gain of 3.29 points over the RL with reward model.
翻译:可验证奖励强化学习(RLVR)在增强大型语言模型(LLMs)的推理能力方面展现出巨大潜力。然而,迄今为止其成功主要局限于具有清晰且可自动验证结果的数学和编程领域。由于缺乏可验证的解决方案,开放式任务(例如创意写作和主观问答)的强化学习仍然依赖于奖励模型。这引出了一个关键问题:我们如何能够扩展RLVR以加强开放式任务中的推理能力,尽管缺乏明确的真实答案?为克服这一挑战,我们引入了用于可验证奖励强化学习的可验证多选重构(VMR-RLVR),这是一种新颖的训练策略,将开放式数据重构为可验证的多选格式,从而即使在缺乏显式真实答案的情况下也能实现有效训练。在多个基准测试上的实验结果验证了我们的方法在提升LLMs于开放式任务上性能的有效性。值得注意的是,在七个开放式基准测试中,我们的VMR-RLVR训练相较于使用奖励模型的强化学习平均带来了3.29分的提升。