A long-standing goal of reinforcement learning is to acquire agents that can learn on training tasks and generalize well on unseen tasks that may share a similar dynamic but with different reward functions. A general challenge is to quantitatively measure the similarities between these different tasks, which is vital for analyzing the task distribution and further designing algorithms with stronger generalization. To address this, we present a novel metric named Task Distribution Relevance (TDR) via optimal Q functions of different tasks to capture the relevance of the task distribution quantitatively. In the case of tasks with a high TDR, i.e., the tasks differ significantly, we show that the Markovian policies cannot differentiate them, leading to poor performance. Based on this insight, we encode all historical information into policies for distinguishing different tasks and propose Task Aware Dreamer (TAD), which extends world models into our reward-informed world models to capture invariant latent features over different tasks. In TAD, we calculate the corresponding variational lower bound of the data log-likelihood, including a novel term to distinguish different tasks via states, to optimize reward-informed world models. Extensive experiments in both image-based control tasks and state-based control tasks demonstrate that TAD can significantly improve the performance of handling different tasks simultaneously, especially for those with high TDR, and demonstrate a strong generalization ability to unseen tasks.
翻译:强化学习的一个长期目标是构建能够在训练任务上学习并良好泛化到未知任务的智能体,这些未知任务可能共享相似的动态特性但具有不同的奖励函数。一个普遍挑战是如何定量衡量这些不同任务之间的相似性,这对于分析任务分布及进一步设计具有更强泛化能力的算法至关重要。为解决此问题,我们提出了一种名为任务分布相关性(TDR)的新指标,该指标通过不同任务的最优Q函数来定量捕捉任务分布的相关性。对于TDR较高的任务(即任务差异显著的情况),我们发现马尔可夫策略无法区分这些任务,从而导致性能不佳。基于这一见解,我们将所有历史信息编码到策略中以区分不同任务,并提出了任务感知梦想家(TAD),该方法将世界模型扩展为我们提出的奖励感知世界模型,以捕捉不同任务间不变的潜在特征。在TAD中,我们计算数据对数似然的相应变分下界,其中包括一个通过状态区分不同任务的新项,以优化奖励感知世界模型。在基于图像的控制任务和基于状态的控制任务上进行的大量实验表明,TAD能够显著提升同时处理不同任务的性能,尤其是对于TDR较高的任务,并且展现出对未知任务的强大泛化能力。