Reinforcement learning from verifiable rewards (RLVR) is a promising paradigm for improving large language model (LLM) agents on long-horizon interactive tasks. However, in partially observable environments, incomplete observations cause agent beliefs to drift over time, while delayed rewards obscure the causal impact of intermediate decisions, exacerbating temporal credit assignment challenges. To address this, we propose ReBel (Reward Belief), a process-level reinforcement learning algorithm that explicitly models structured belief states to summarize interaction history and guide subsequent policy learning. ReBel introduces belief-consistency supervision, converting discrepancies between predicted beliefs and observed feedback into dense self-supervised signals without requiring external step-wise annotations or verifiers. It also employs belief-aware grouping to compare trajectories under similar belief states, yielding more robust and lower-variance advantage estimates. We evaluate ReBel on challenging long-horizon benchmarks, including ALFWorld and WebShop. ReBel improves task success by up to $20.4$ percentage points over the episode-level baseline GRPO and increases sample efficiency by $2.1\times$. These results suggest that belief-aware self-supervision is a promising direction for reliable long-horizon decision-making under partial observability. Code is available at: https://github.com/Fateyetian/Rebel.git.
翻译:基于可验证奖励的强化学习(RLVR)是提升大语言模型(LLM)智能体在长时域交互任务中表现的有效范式。然而,在部分可观测环境中,不完整观测会导致智能体信念随时间漂移,同时延迟奖励会遮蔽中间决策的因果影响,加剧时序信用分配难题。为此,我们提出ReBel(Reward Belief)算法——一种显式建模结构化信念状态以总结交互历史并指导后续策略学习的进程级强化学习算法。ReBel引入信念一致性监督机制,将预测信念与观测反馈的差异转化为密集自监督信号,无需外部逐步标注或验证器。该方法还采用信念感知分组技术,在相似信念状态下比较轨迹,获得更稳健且方差更低的优势估计。我们在ALFWorld和WebShop等具有挑战性的长时域基准上评估ReBel。相对于回合级基线GRPO,ReBel将任务成功率提升最高20.4个百分点,采样效率提升2.1倍。实验结果表明,信念感知自监督机制是实现部分可观测环境下可靠长时域决策的有前景方向。代码开源地址:https://github.com/Fateyetian/Rebel.git。