Agent Learning via Early Experience

Kai Zhang,Xiangchao Chen,Bo Liu,Tianci Xue,Zeyi Liao,Zhihan Liu,Xiyao Wang,Yuting Ning,Zhaorun Chen,Xiaohan Fu,Jian Xie,Yuxuan Sun,Boyu Gou,Qi Qi,Zihang Meng,Jianwei Yang,Ning Zhang,Xian Li,Ashish Shah,Dat Huynh,Hengduo Li,Zi Yang,Sara Cao,Lawrence Jang,Shuyan Zhou,Jiacheng Zhu,Huan Sun,Jason Weston,Yu Su,Yifan Wu

from arxiv, ICML 2026

A long-term goal of language agents is to learn and improve through their own experience, ultimately outperforming humans in complex, real-world tasks. However, training agents from experience data with reinforcement learning remains difficult in many environments, which either lack verifiable rewards (e.g., websites) or require inefficient long-horizon rollouts (e.g., multi-turn tool use). As a result, most current agents rely on supervised fine-tuning on expert data, which is challenging to scale and generalizes poorly. This limitation stems from the nature of expert demonstrations: they capture only a narrow range of scenarios, and expose the agent to limited environment diversity. We address this limitation with a middle-ground paradigm we call early experience: interaction data generated by the agent's own actions, where the resulting future states serve as supervision without reward signals. Within this paradigm, we study two strategies of using such data: (1) implicit world modeling, which uses collected states to ground the policy in environment dynamics; and (2) self-reflection, where the agent learns from its suboptimal actions to improve reasoning and decision-making. Evaluation across eight diverse environments and multiple model families shows that our approaches consistently improve effectiveness and out-of-domain generalization, highlighting the value of early experience. Moreover, in environments with verifiable rewards, our results provide promising signals that early experience offers a strong foundation for subsequent reinforcement learning, making it a practical bridge between imitation learning and fully experience-driven agents.

翻译：语言智能体的长期目标是通过自身的经验进行学习与改进，最终在复杂现实任务中超越人类。然而，在许多环境中，通过强化学习从经验数据中训练智能体仍面临困难：这些环境要么缺乏可验证的奖励（例如网站），要么需要低效的长程推演（例如多轮工具使用）。因此，当前多数智能体依赖基于专家数据的监督微调，但这种方法难以扩展且泛化能力差。这一局限源于专家示范的本质：它们仅覆盖狭窄的场景范围，且使智能体暴露于有限的环境多样性。我们提出一种折中范式——早期经验，即由智能体自身行为生成的交互数据，其中产生的未来状态可作为无需奖励信号的监督。在此范式下，我们研究两种利用此类数据的策略：（1）隐式世界建模，利用收集的状态将策略锚定于环境动力学中；（2）自我反思，智能体通过从自身次优行为中学习来改进推理与决策。在八个多样环境及多个模型族上的评估表明，我们的方法一致地提升了效果与跨领域泛化能力，凸显了早期经验的价值。此外，在具有可验证奖励的环境中，我们的结果提供了令人鼓舞的信号：早期经验为后续强化学习奠定了坚实基础，使其成为模仿学习与完全经验驱动型智能体之间的实用桥梁。