From Reasoning to Agentic: Credit Assignment in Reinforcement Learning for Large Language Models

Reinforcement learning (RL) for large language models (LLMs) increasingly relies on sparse, outcome-level rewards -- yet determining which actions within a long trajectory caused the outcome remains difficult. This credit assignment (CA) problem manifests in two regimes: reasoning RL, where credit must be distributed across tokens and steps within a single chain-of-thought generation (500--30K+ tokens); and agentic RL, where multi-turn environment interaction introduces stochastic transitions, partial observability, and horizons of 100+ turns (100K--1M tokens), making episode-level credit increasingly uninformative. We survey 47 CA methods (41 core, 6 adjacent enablers) published between 2024 and early 2026, organizing them in a two-dimensional taxonomy by assignment granularity (token, segment, step, turn, multi-agent) and methodology (Monte Carlo, temporal difference, model-based, game-theoretic, information-theoretic). Beyond the survey itself, we contribute three reusable resources: (1) a structured, machine-readable paper inventory with taxonomy labels, baseline families, and evidence levels; (2) a reporting checklist for future CA papers, validated against the reviewed literature to identify systematic methodological gaps; and (3) a benchmark protocol specification with task families, metadata requirements, and controlled bifurcation tasks, accompanied by a method selection decision tree. Our synthesis suggests that the shift from reasoning to agentic RL complicates and reshapes the credit assignment landscape: reasoning CA is maturing around process reward models and critic-free group comparison, while agentic CA is driving genuinely new approaches -- hindsight counterfactual analysis, privileged asymmetric critics, and turn-level MDP reformulations -- that have no direct precedent in reasoning RL.

翻译：大型语言模型（LLM）的强化学习（RL）日益依赖于稀疏的、基于结果的奖励——然而，确定长轨迹中的哪些动作导致了该结果仍然困难。这一信用分配（CA）问题体现在两种范式中：推理强化学习，其中信用必须在单次思维链生成（500-30K+令牌）的令牌和步骤之间分配；以及智能体强化学习，其中多轮环境交互引入了随机转换、部分可观测性和100+轮次（100K-1M令牌）的视界，使得情节级信用越来越缺乏信息量。我们调查了2024年至2026年初期间发表的47种CA方法（41种核心方法，6种相邻使能方法），并按照分配粒度（令牌、片段、步骤、轮次、多智能体）和方法论（蒙特卡洛、时间差分、基于模型、博弈论、信息论）构建了一个二维分类体系。除综述本身外，我们贡献了三种可复用资源：（1）一个结构化的、机器可读的论文清单，包含分类标签、基线族和证据等级；（2）一份面向未来CA论文的报告清单，经审阅文献验证以识别系统性的方法论空白；以及（3）一个基准协议规范，包含任务族、元数据要求和受控分岔任务，并附带一个方法选择决策树。我们的综合分析表明，从推理型到智能体型强化学习的转变复杂化了并重塑了信用分配格局：推理型CA正围绕过程奖励模型和无批评者群体比较而趋于成熟，而智能体型CA正推动真正新颖的方法——事后反事实分析、特权非对称批评者以及轮次级MDP重构——这些方法在推理型强化学习中并无直接先例。