A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.
翻译:语言智能体的长期目标是通过自身经验进行学习与改进,最终在复杂的现实任务中超越人类。然而,在许多环境中,利用经验数据通过强化学习训练智能体仍然存在困难:这些环境要么缺乏可验证的奖励信号(例如网站交互),要么需要低效的长时程推演(例如多轮工具调用)。因此,当前大多数智能体依赖于专家数据的监督微调,这种方法难以扩展且泛化能力较弱。这一局限源于专家示范的本质:它们仅涵盖有限范围的场景,使智能体接触的环境多样性不足。我们提出一种折中范式——早期经验,以应对此局限:即由智能体自身行为生成的交互数据,其中所产生的未来状态作为无需奖励信号的监督信号。在此范式下,我们研究了利用此类数据的两种策略:(1)隐式世界建模,利用收集的状态将策略基于环境动态进行锚定;(2)自我反思,智能体从其次优行动中学习以改进推理与决策能力。我们在八个多样化环境及多种模型架构中进行评估。我们的方法持续提升了任务效能与领域外泛化能力,彰显了早期经验的价值。此外,在具有可验证奖励的环境中,我们的结果表明早期经验为后续强化学习提供了坚实基础,这使其成为模仿学习与完全经验驱动智能体之间的实用桥梁。