The dominant framework for off-policy multi-goal reinforcement learning involves estimating goal conditioned Q-value function. When learning to achieve multiple goals, data efficiency is intimately connected with the generalization of the Q-function to new goals. The de-facto paradigm is to approximate Q(s, a, g) using monolithic neural networks. To improve the generalization of the Q-function, we propose a bilinear decomposition that represents the Q-value via a low-rank approximation in the form of a dot product between two vector fields. The first vector field, f(s, a), captures the environment's local dynamics at the state s; whereas the second component, {\phi}(s, g), captures the global relationship between the current state and the goal. We show that our bilinear decomposition scheme substantially improves data efficiency, and has superior transfer to out-of-distribution goals compared to prior methods. Empirical evidence is provided on the simulated Fetch robot task-suite and dexterous manipulation with a Shadow hand.
翻译:离策略多目标强化学习的主要框架涉及估计基于目标的Q值函数。在实现多个目标的学习过程中,数据效率与Q函数对新目标的泛化能力紧密相关。目前的主流范式是使用单一神经网络近似Q(s, a, g)。为提升Q函数的泛化能力,我们提出一种双线性分解方法,通过两个向量场的点积形式进行低秩近似来表示Q值。第一个向量场f(s, a)捕捉状态s下环境的局部动力学特性;第二个分量φ(s, g)则捕获当前状态与目标之间的全局关系。实验表明,我们的双线性分解方案显著提升了数据效率,并且在分布外目标的迁移能力上优于现有方法。我们在模拟的Fetch机器人任务套件和Shadow手灵巧操作任务上提供了实证依据。