A long-standing goal of reinforcement learning is to acquire agents that can learn on training tasks and generalize well on unseen tasks that may share a similar dynamic but with different reward functions. The ability to generalize across tasks is important as it determines an agent's adaptability to real-world scenarios where reward mechanisms might vary. In this work, we first show that training a general world model can utilize similar structures in these tasks and help train more generalizable agents. Extending world models into the task generalization setting, we introduce a novel method named Task Aware Dreamer (TAD), which integrates reward-informed features to identify consistent latent characteristics across tasks. Within TAD, we compute the variational lower bound of sample data log-likelihood, which introduces a new term designed to differentiate tasks using their states, as the optimization objective of our reward-informed world models. To demonstrate the advantages of the reward-informed policy in TAD, we introduce a new metric called Task Distribution Relevance (TDR) which quantitatively measures the relevance of different tasks. For tasks exhibiting a high TDR, i.e., the tasks differ significantly, we illustrate that Markovian policies struggle to distinguish them, thus it is necessary to utilize reward-informed policies in TAD. Extensive experiments in both image-based and state-based tasks show that TAD can significantly improve the performance of handling different tasks simultaneously, especially for those with high TDR, and display a strong generalization ability to unseen tasks.
翻译:强化学习的一个长期目标是获得能够在训练任务上学习,并在未见任务上表现出良好泛化能力的智能体,这些任务可能具有相似的动态特性但奖励函数不同。跨任务泛化能力至关重要,因为它决定了智能体对现实场景的适应能力,而现实场景中的奖励机制可能存在差异。在本研究中,我们首先证明训练通用世界模型可以利用这些任务中的相似结构,并有助于训练更具泛化能力的智能体。将世界模型扩展至任务泛化场景,我们提出了一种名为任务感知型Dreamer(TAD)的新方法,该方法整合了奖励感知特征以识别跨任务间一致的潜在特性。在TAD框架内,我们计算样本数据对数似然的变分下界作为奖励感知世界模型的优化目标,该下界引入了一个新项,旨在利用任务状态对任务进行区分。为证明TAD中奖励感知策略的优势,我们引入了一种称为任务分布相关性(TDR)的新度量指标,用于定量衡量不同任务间的相关性。对于具有高TDR(即任务差异显著)的任务,我们证明马尔可夫策略难以区分它们,因此有必要在TAD中使用奖励感知策略。在基于图像和基于状态的任务上进行的大量实验表明,TAD能显著提升同时处理不同任务的性能,尤其对于高TDR任务,并展现出对未见任务的强大泛化能力。