Agent Learning via Early Experience

Kai Zhang,Xiangchao Chen,Bo Liu,Tianci Xue,Zeyi Liao,Zhihan Liu,Xiyao Wang,Yuting Ning,Zhaorun Chen,Xiaohan Fu,Jian Xie,Yuxuan Sun,Boyu Gou,Qi Qi,Zihang Meng,Jianwei Yang,Ning Zhang,Xian Li,Ashish Shah,Dat Huynh,Hengduo Li,Zi Yang,Sara Cao,Lawrence Jang,Shuyan Zhou,Jiacheng Zhu,Huan Sun,Jason Weston,Yu Su,Yifan Wu

from arxiv, Work in progress

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm we study two strategies of using such data: (1) Implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) Self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. We evaluate across eight diverse environments and multiple model families. Our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, positioning it as a practical bridge between imitation learning and fully experience-driven agents.

翻译：语言智能体的长期目标是通过自身经验进行学习与改进，最终在复杂的现实任务中超越人类。然而，在许多环境中，利用经验数据通过强化学习训练智能体仍然存在困难：这些环境要么缺乏可验证的奖励信号（例如网站交互），要么需要低效的长时程推演（例如多轮工具调用）。因此，当前大多数智能体依赖于专家数据的监督微调，这种方法难以扩展且泛化能力较弱。这一局限源于专家示范的本质：它们仅涵盖有限范围的场景，使智能体接触的环境多样性不足。我们提出一种折中范式——早期经验，以应对此局限：即由智能体自身行为生成的交互数据，其中所产生的未来状态作为无需奖励信号的监督信号。在此范式下，我们研究了利用此类数据的两种策略：（1）隐式世界建模，利用收集的状态将策略基于环境动态进行锚定；（2）自我反思，智能体从其次优行动中学习以改进推理与决策能力。我们在八个多样化环境及多种模型架构中进行评估。我们的方法持续提升了任务效能与领域外泛化能力，彰显了早期经验的价值。此外，在具有可验证奖励的环境中，我们的结果表明早期经验为后续强化学习提供了坚实基础，这使其成为模仿学习与完全经验驱动智能体之间的实用桥梁。