Text2Grad: Reinforcement Learning from Natural Language Feedback

Traditional RLHF optimizes language models with coarse, scalar rewards that mask the fine-grained reasons behind success or failure, leading to slow and opaque learning. Recent work augments RL with textual critiques through prompting or reflection, improving interpretability but leaving model parameters untouched. We introduce Text2Grad, a reinforcement-learning paradigm that turns free-form textual feedback into span-level gradients. Given human (or programmatic) critiques, Text2Grad aligns each feedback phrase with the relevant token spans, converts these alignments into differentiable reward signals, and performs gradient updates that directly refine the offending portions of the model's policy. This yields precise, feedback-conditioned adjustments instead of global nudges. Text2Grad is realized through three components: (1) a high-quality feedback-annotation pipeline that pairs critiques with token spans; (2) a fine-grained reward model that predicts span-level reward on answers while generating explanatory critiques; and (3) a span-level policy optimizer that back-propagates natural-language gradients. Across summarization, code generation, and question answering, Text2Grad consistently surpasses scalar-reward RL and prompt-only baselines, providing both higher task metrics and richer interpretability. Our results suggest that natural-language feedback can serve not only as explanations, but also as actionable training signals for fine-grained alignment. The code for our method is available at https://github.com/microsoft/Text2Grad.

翻译：传统的基于人类反馈的强化学习（RLHF）使用粗糙的标量奖励来优化语言模型，这掩盖了成功或失败背后的细粒度原因，导致学习过程缓慢且不透明。近期研究通过提示或反思，用文本批评来增强强化学习，提高了可解释性，但模型参数本身并未被修改。我们提出了Text2Grad，一种将自由形式的文本反馈转化为片段级梯度的强化学习范式。给定人类（或程序化）的批评意见，Text2Grad将每个反馈短语与相关的词元片段对齐，将这些对齐关系转化为可微分的奖励信号，并执行梯度更新，直接修正模型策略中出问题的部分。这产生了精确的、基于反馈的调整，而非全局性的微调。Text2Grad通过三个组件实现：（1）一个将批评与词元片段配对的高质量反馈标注流程；（2）一个细粒度的奖励模型，在生成解释性批评的同时预测答案的片段级奖励；（3）一个片段级策略优化器，用于反向传播自然语言梯度。在摘要生成、代码生成和问答任务中，Text2Grad持续超越标量奖励强化学习和仅使用提示的基线方法，既提供了更高的任务指标，也带来了更丰富的可解释性。我们的结果表明，自然语言反馈不仅可以作为解释，还可以作为实现细粒度对齐的可操作训练信号。我们方法的代码可在 https://github.com/microsoft/Text2Grad 获取。