Cross-domain reinforcement learning (CDRL) is meant to improve the data efficiency of RL by leveraging the data samples collected from a source domain to facilitate the learning in a similar target domain. Despite its potential, cross-domain transfer in RL is known to have two fundamental and intertwined challenges: (i) The source and target domains can have distinct state space or action space, and this makes direct transfer infeasible and thereby requires more sophisticated inter-domain mappings; (ii) The transferability of a source-domain model in RL is not easily identifiable a priori, and hence CDRL can be prone to negative effect during transfer. In this paper, we propose to jointly tackle these two challenges through the lens of \textit{cross-domain Bellman consistency} and \textit{hybrid critic}. Specifically, we first introduce the notion of cross-domain Bellman consistency as a way to measure transferability of a source-domain model. Then, we propose $Q$Avatar, which combines the Q functions from both the source and target domains with an adaptive hyperparameter-free weight function. Through this design, we characterize the convergence behavior of $Q$Avatar and show that $Q$Avatar achieves reliable transfer in the sense that it effectively leverages a source-domain Q function for knowledge transfer to the target domain. Through experiments, we demonstrate that $Q$Avatar achieves favorable transferability across various RL benchmark tasks, including locomotion and robot arm manipulation. Our code is available at https://rl-bandits-lab.github.io/Cross-Domain-RL/.
翻译:跨域强化学习旨在通过利用从源域收集的数据样本来促进相似目标域中的学习,从而提高强化学习的数据效率。尽管具有潜力,强化学习中的跨域迁移已知存在两个根本且相互交织的挑战:(i) 源域与目标域可能具有不同的状态空间或动作空间,这使得直接迁移不可行,从而需要更复杂的域间映射;(ii) 强化学习中源域模型的可迁移性难以先验确定,因此跨域强化学习在迁移过程中容易产生负面效应。本文提出通过\textit{跨域贝尔曼一致性}与\textit{混合批评器}的视角共同应对这两个挑战。具体而言,我们首先引入跨域贝尔曼一致性的概念,作为衡量源域模型可迁移性的方法。随后,我们提出$Q$Avatar算法,该算法通过一个自适应的无超参数权重函数,将源域与目标域的Q函数相结合。通过此设计,我们刻画了$Q$Avatar的收敛行为,并证明$Q$Avatar实现了可靠的迁移,即它能有效利用源域Q函数向目标域进行知识迁移。实验表明,$Q$Avatar在包括运动控制与机械臂操作在内的多种强化学习基准任务中均展现出优越的可迁移性。我们的代码发布于 https://rl-bandits-lab.github.io/Cross-Domain-RL/。