From Verifiable Dot to Reward Chain: Harnessing Verifiable Reference-based Rewards for Reinforcement Learning of Open-ended Generation

Reinforcement learning with verifiable rewards (RLVR) succeeds in reasoning tasks (e.g., math and code) by checking the final verifiable answer (i.e., a verifiable dot signal). However, extending this paradigm to open-ended generation is challenging because there is no unambiguous ground truth. Relying on single-dot supervision often leads to inefficiency and reward hacking. To address these issues, we propose reinforcement learning with verifiable reference-based rewards (RLVRR). Instead of checking the final answer, RLVRR extracts an ordered linguistic signal from high-quality references (i.e, reward chain). Specifically, RLVRR decomposes rewards into two dimensions: content, which preserves deterministic core concepts (e.g., keywords), and style, which evaluates adherence to stylistic properties through LLM-based verification. In this way, RLVRR combines the exploratory strength of RL with the efficiency and reliability of supervised fine-tuning (SFT). Extensive experiments on more than 10 benchmarks with Qwen and Llama models confirm the advantages of our approach. RLVRR (1) substantially outperforms SFT trained with ten times more data and advanced reward models, (2) unifies the training of structured reasoning and open-ended generation, and (3) generalizes more effectively while preserving output diversity. These results establish RLVRR as a principled and efficient path toward verifiable reinforcement learning for general-purpose LLM alignment. We release our code and data at https://github.com/YJiangcm/RLVRR.

翻译：基于可验证奖励的强化学习（RLVR）在推理任务（例如数学和代码）中取得了成功，其方法是通过检查最终的可验证答案（即可验证点信号）。然而，将这一范式扩展到开放式生成具有挑战性，因为不存在明确的标准答案。依赖单点监督通常会导致效率低下和奖励黑客攻击。为了解决这些问题，我们提出了基于可验证参考奖励的强化学习（RLVRR）。RLVRR不是检查最终答案，而是从高质量参考中提取有序的语言信号（即奖励链）。具体而言，RLVRR将奖励分解为两个维度：内容维度，用于保留确定性的核心概念（例如关键词）；以及风格维度，通过基于大语言模型的验证来评估对风格属性的遵循程度。通过这种方式，RLVRR结合了强化学习的探索优势与监督微调（SFT）的效率和可靠性。在超过10个基准测试上使用Qwen和Llama模型进行的广泛实验证实了我们方法的优势。RLVRR（1）显著优于使用十倍数据训练和先进奖励模型的SFT，（2）统一了结构化推理和开放式生成的训练，以及（3）在保持输出多样性的同时实现了更有效的泛化。这些结果确立了RLVRR作为实现通用大语言模型对齐的可验证强化学习的一条原则性且高效的路径。我们在https://github.com/YJiangcm/RLVRR发布了代码和数据。