The success of RL for LLM post-training stems from an unreasonably uninformative source: a single bit of information per rollout as binary reward or preference label. At the other extreme, distillation offers dense supervision but requires demonstrations, which are costly and difficult to scale. We study text feedback as an intermediate signal: richer than scalar rewards, yet cheaper than complete demonstrations. Textual feedback is a natural mode of human interaction and is already abundant in many real-world settings, where users, annotators, and automated judges routinely critique LLM outputs. Towards leveraging text feedback at scale, we formalize a multi-turn RL setup, RL from Text Feedback (RLTF), where text feedback is available during training but not at inference. Therefore, models must learn to internalize the feedback in order to improve their test-time single-turn performance. To do this, we propose two methods: Self Distillation (RLTF-SD), which trains the single-turn policy to match its own feedback-conditioned second-turn generations; and Feedback Modeling (RLTF-FM), which predicts the feedback as an auxiliary objective. We provide theoretical analysis on both methods, and empirically evaluate on reasoning puzzles, competition math, and creative writing tasks. Our results show that both methods consistently outperform strong baselines across benchmarks, highlighting the potential of RL with an additional source of rich supervision at scale.
翻译:大型语言模型后训练中强化学习的成功源于一个看似信息匮乏的来源:每个训练轮次仅提供单比特信息,即二元奖励或偏好标签。相反,蒸馏方法虽能提供密集监督,却需要依赖成本高昂且难以规模化获取的演示数据。本研究将文本反馈作为一种中间信号进行探索:它比标量奖励信息更丰富,同时比完整演示数据成本更低。文本反馈作为自然的人类交互模式,已在众多现实场景中广泛存在——用户、标注员及自动化评估系统经常对大型语言模型的输出进行评判。为规模化利用文本反馈,我们形式化提出了多轮强化学习框架RLTF(基于文本反馈的强化学习),该框架允许在训练阶段使用文本反馈,但在推理阶段不提供此类反馈。因此,模型必须学会内化反馈信息以提升测试阶段单轮对话的性能。为实现这一目标,我们提出两种方法:自蒸馏法(RLTF-SD)——训练单轮策略以匹配其自身在反馈条件下的第二轮生成结果;反馈建模法(RLTF-FM)——将反馈预测作为辅助训练目标。我们对两种方法进行了理论分析,并在推理谜题、竞赛数学及创意写作任务上进行了实证评估。实验结果表明,两种方法在各项基准测试中均持续超越强基线模型,彰显了强化学习通过引入规模化丰富监督源的巨大潜力。