Reinforcement learning for open-ended text generation is constrained by the lack of verifiable rewards, necessitating reliance on judge models that require either annotated data or powerful closed-source models. Inspired by recent work on unsupervised reinforcement learning for mathematical reasoning using confidence-based endogenous rewards, we investigate whether this principle can be adapted to open-ended writing tasks. We find that directly applying confidence rewards leads to Triviality Bias: the policy collapses toward high-probability outputs, reducing diversity and meaningful content. We propose TCER (Triviality Corrected Endogenous Reward), which addresses this bias by rewarding the relative information gain between a specialist policy and a generalist reference policy, modulated by a probability-dependent correction mechanism. Across multiple writing benchmarks and model architectures, TCER achieves consistent improvements without external supervision. Furthermore, TCER also transfers effectively to mathematical reasoning, validating the generality of our approach across different generation tasks.
翻译:面向开放式文本生成的强化学习因缺乏可验证的奖励机制而受到制约,不得不依赖需要标注数据或强大闭源模型的评判模型。受近期基于置信度内生奖励进行无监督强化学习数学推理研究的启发,我们探究了该原理能否适用于开放式写作任务。研究发现,直接应用置信度奖励会导致琐碎性偏差:策略向高概率输出坍缩,降低了内容多样性与意义性。为此我们提出TCER(琐碎性修正内生奖励),通过奖励专家策略与通用参考策略之间的相对信息增益,并辅以概率依赖的修正机制,从而解决该偏差问题。在多个写作基准测试与模型架构中,TCER无需外部监督即可实现一致性改进。此外,TCER还能有效迁移至数学推理任务,验证了该方法在不同生成任务中的普适性。