A long-standing goal of reinforcement learning is that algorithms can learn on training tasks and generalize well on unseen tasks like humans, where different tasks share similar dynamic with different reward functions. A general challenge is that it is nontrivial to quantitatively measure the similarities between these different tasks, which is vital for analyzing the task distribution and further designing algorithms with stronger generalization. To address this, we present a novel metric named Task Distribution Relevance (TDR) via optimal Q functions to capture the relevance of the task distribution quantitatively. In the case of tasks with a high TDR, i.e., the tasks differ significantly, we demonstrate that the Markovian policies cannot distinguish them, yielding poor performance accordingly. Based on this observation, we propose a framework of Reward Informed Dreamer (RID) with reward-informed world models, which captures invariant latent features over tasks and encodes reward signals into policies for distinguishing different tasks. In RID, we calculate the corresponding variational lower bound of the log-likelihood on the data, which includes a novel term to distinguish different tasks via states, based on reward-informed world models. Finally, extensive experiments in DeepMind control suite demonstrate that RID can significantly improve the performance of handling different tasks at the same time, especially for those with high TDR, and further generalize to unseen tasks effectively.
翻译:强化学习的长期目标之一是算法能够在训练任务上学习,并像人类一样在未见任务上良好泛化,其中不同任务共享相似的动态但具有不同的奖励函数。一个普遍挑战是,难以定量衡量这些不同任务之间的相似性,而这对于分析任务分布以及进一步设计具有更强泛化能力的算法至关重要。为解决这一问题,我们提出了一种名为任务分布相关性(TDR)的新型度量指标,通过最优Q函数来定量捕捉任务分布的相关性。在TDR较高(即任务差异显著)的情况下,我们证明了马尔可夫策略无法区分这些任务,从而导致性能不佳。基于这一观察,我们提出了一种奖励信息驱动的梦想家(RID)框架,该框架采用奖励感知的世界模型,捕捉任务间的不变潜在特征,并将奖励信号编码到策略中以区分不同任务。在RID中,我们计算了数据对数似然的相应变分下界,其中包括一个基于奖励感知世界模型通过状态区分不同任务的新项。最后,在DeepMind控制套件上的大量实验表明,RID能够显著提升同时处理不同任务的性能,尤其是对于TDR较高的任务,并能有效地泛化到未见任务。