Transfer in Reinforcement Learning aims to improve learning performance on target tasks using knowledge from experienced source tasks. Successor Representations (SR) and their extension Successor Features (SF) are prominent transfer mechanisms in domains where reward functions change between tasks. They reevaluate the expected return of previously learned policies in a new target task to transfer their knowledge. The SF framework extended SR by linearly decomposing rewards into successor features and a reward weight vector allowing their application in high-dimensional tasks. But this came with the cost of having a linear relationship between reward functions and successor features, limiting its application to tasks where such a linear relationship exists. We propose a novel formulation of SR based on learning the cumulative discounted probability of successor features, called Successor Feature Representations (SFR). Crucially, SFR allows to reevaluate the expected return of policies for general reward functions. We introduce different SFR variations, prove its convergence, and provide a guarantee on its transfer performance. Experimental evaluations based on SFR with function approximation demonstrate its advantage over SF not only for general reward functions, but also in the case of linearly decomposable reward functions.
翻译:强化学习中的迁移学习旨在利用从已有源任务中获得的知识,提升在新目标任务中的学习性能。后继表示(SR)及其扩展后继特征(SF)是奖励函数在任务间变化场景下的重要迁移机制。它们通过在新目标任务中重新评估先前所学策略的期望回报来实现知识迁移。SF框架通过将奖励函数线性分解为后继特征与奖励权重向量的形式,扩展了SR的适用范围,使其能够应用于高维任务。但这种分解方式要求奖励函数与后继特征之间存在线性关系,这限制了其应用场景——仅适用于存在此类线性关系的任务。我们提出了一种基于学习后继特征累积折扣概率的新型SR方法,称为后继特征表示(SFR)。关键之处在于,SFR能够对任意通用奖励函数对应的策略期望回报进行重新评估。我们引入了SFR的多种变体,证明了其收敛性,并给出了其迁移性能的保证。基于函数近似的SFR实验评估表明:不仅在通用奖励函数场景下,即便在线性可分解奖励函数的场景中,SFR也具备优于SF的表现。