A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios, and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm, we study two strategies of using such data: (1) implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. Evaluation across eight diverse environments and multiple model families shows that our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, making it a practical bridge between imitation learning and fully experience-driven agents.
翻译:语言智能体的长期目标是通过自身的经验进行学习与改进,最终在复杂现实任务中超越人类。然而,在许多环境中,通过强化学习从经验数据中训练智能体仍面临困难:这些环境要么缺乏可验证的奖励(例如网站),要么需要低效的长程推演(例如多轮工具使用)。因此,当前多数智能体依赖基于专家数据的监督微调,但这种方法难以扩展且泛化能力差。这一局限源于专家示范的本质:它们仅覆盖狭窄的场景范围,且使智能体暴露于有限的环境多样性。我们提出一种折中范式——早期经验,即由智能体自身行为生成的交互数据,其中产生的未来状态可作为无需奖励信号的监督。在此范式下,我们研究两种利用此类数据的策略:(1)隐式世界建模,利用收集的状态将策略锚定于环境动力学中;(2)自我反思,智能体通过从自身次优行为中学习来改进推理与决策。在八个多样环境及多个模型族上的评估表明,我们的方法一致地提升了效果与跨领域泛化能力,凸显了早期经验的价值。此外,在具有可验证奖励的环境中,我们的结果提供了令人鼓舞的信号:早期经验为后续强化学习奠定了坚实基础,使其成为模仿学习与完全经验驱动型智能体之间的实用桥梁。