Large language model (LLM) agents learn by interacting with environments, but long-horizon training remains fundamentally bottlenecked by sparse and delayed rewards. Existing methods typically address this challenge through post-hoc credit assignment or external reward models, which provide limited guidance at inference time and often separate reward improvement from policy improvement. We propose Self-Guide, a self-generated internal reward for language agents that supports both inference-time guidance and training-time supervision. Specifically, the agent uses Self-Guide as a short self-guidance signal to steer the next action during inference, and converts the same signal into step-level internal reward for denser policy optimization during training. This creates a co-evolving loop: better policy produces better guidance, and better guidance further improves policy as internal reward. Across three agent benchmarks, inference-time self-guidance already yields clear gains, while jointly evolving policy and internal reward with GRPO brings further improvements (8\%) over baselines trained solely with environment reward. Overall, our results suggest that language agents can improve not only by collecting more experience, but also by learning to generate and refine their own internal reward during acting and learning.
翻译:大型语言模型(LLM)智能体通过与环境的交互进行学习,但长期训练本质上受限于稀疏且延迟的奖励信号。现有方法通常通过事后信用分配或外部奖励模型来应对这一挑战,这些方法在推理阶段提供的指导有限,且往往将奖励优化与策略优化分开。我们提出Self-Guide——一种面向语言智能体的自生成内部奖励,可同时支持推理阶段引导与训练阶段监督。具体而言,智能体在推理阶段利用Self-Guide作为短暂的自导引信号来指引下一动作,并将相同信号转化为步级内部奖励,以实现训练期间更密集的策略优化。这形成了一个协同演化循环:更优策略产生更优引导,而更优引导又作为内部奖励进一步改进策略。在三个智能体基准测试中,推理阶段的自导引已带来显著增益,而采用GRPO联合演化策略与内部奖励,相较于仅使用环境奖励训练的基线方法,性能进一步提升了8%。总体而言,我们的结果表明:语言智能体不仅可通过积累更多经验来提升性能,更能在行动与学习过程中学会生成并优化自身的内部奖励。