We envision continually learning agentic systems that become more useful over time: as they encounter sequences of related tasks, they should infer the hidden structure shared across those tasks and use it to improve future decisions. This cross-task experiential learning capability is pivotal in domains such as personalization and interactive assistance, but existing training/evaluation frameworks do not provide shared, controllable latent structures and cannot measure whether or why agents improve. We introduce LatentGym: a controllable suite in which each environment is organized around a ground-truth latent variable governing the structure across tasks. Our construction yields metrics that separate exploration (whether the agent's actions gather information about the latent) from exploitation (whether the agent uses what it has gathered). We demonstrate our suite on empirical studies addressing three questions: how and why frontier models fail to adapt across related tasks; whether post-training on related task sequences improves general cross-task adaptation, and where those gains come from; and how design choices such as inter-task feedback shape training dynamics and generalization. Together, these results establish a controlled foundation for studying how LLM agents learn from experience across tasks, and for designing agents that adapt more reliably in sequential, personalized, and interactive settings.
翻译:我们设想持续学习的智能体系统能够随时间推移变得更有用:当它们遇到一系列相关任务时,应能推断这些任务间共享的隐藏结构,并利用该结构改进未来决策。这种跨任务体验学习能力在个性化与交互式辅助等领域至关重要,但现有训练/评估框架既不提供共享的可控潜在结构,也无法测量智能体是否改进及其原因。我们提出LatentGym:一个可控的套件,其中每个环境均围绕控制跨任务结构的真实潜在变量构建。我们的构造产生了能够区分探索(智能体行为是否收集关于潜在变量的信息)与利用(智能体是否利用已收集信息)的指标。我们通过实证研究展示了该套件的应用,回答了三方面问题:前沿模型如何及为何未能适应相关任务;跨相关任务序列的后训练是否改善一般跨任务适应性及其改进来源;以及任务间反馈等设计选择如何影响训练动态与泛化能力。这些结果共同为研究LLM智能体如何从跨任务经验中学习、以及设计在顺序、个性化与交互式场景中更可靠适应的智能体奠定了受控基础。