A long-standing goal in AI is to build agents that can solve a variety of tasks across different environments, including previously unseen ones. Two dominant approaches tackle this challenge: (i) reinforcement learning (RL), which learns policies through trial and error, and (ii) optimal control, which plans actions using a learned or known dynamics model. However, their relative strengths and weaknesses remain underexplored in the setting where agents must learn from offline trajectories without reward annotations. In this work, we systematically analyze the performance of different RL and control-based methods under datasets of varying quality. On the RL side, we consider goal-conditioned and zero-shot approaches. On the control side, we train a latent dynamics model using the Joint Embedding Predictive Architecture (JEPA) and use it for planning. We study how dataset properties-such as data diversity, trajectory quality, and environment variability-affect the performance of these approaches. Our results show that model-free RL excels when abundant, high-quality data is available, while model-based planning excels in generalization to novel environment layouts, trajectory stitching, and data-efficiency. Notably, planning with a latent dynamics model emerges as a promising approach for zero-shot generalization from suboptimal data.
翻译:人工智能领域的一个长期目标是构建能够在不同环境中(包括先前未见过的环境)解决多种任务的智能体。目前主要有两种主流方法应对这一挑战:(i)强化学习(RL),通过试错学习策略;(ii)最优控制,利用学习或已知的动力学模型进行动作规划。然而,在智能体必须从无奖励标注的离线轨迹中学习的场景下,这两种方法的相对优势与不足仍未得到充分探索。本研究系统分析了不同质量数据集下各类基于RL与控制方法的性能表现。在RL方面,我们考虑了目标条件化方法与零样本方法;在控制方面,我们采用联合嵌入预测架构(JEPA)训练潜在动力学模型并用于规划。我们探究了数据集特性——如数据多样性、轨迹质量和环境可变性——如何影响这些方法的性能。实验结果表明:当可获得充足的高质量数据时,无模型RL表现优异;而基于模型的规划在新环境布局的泛化能力、轨迹拼接和数据效率方面更具优势。值得注意的是,利用潜在动力学模型进行规划的方法,在从次优数据实现零样本泛化方面展现出巨大潜力。