We study the problem of learning optimal behavior from sub-optimal datasets for goal-conditioned offline reinforcement learning under sparse rewards, invertible actions and deterministic transitions. To mitigate the effects of \emph{distribution shift}, we propose MetricRL, a method that combines metric learning for value function approximation with weighted imitation learning for policy estimation. MetricRL avoids conservative or behavior-cloning constraints, enabling effective learning even in severely sub-optimal regimes. We introduce distance monotonicity as a key property linking metric representations to optimality and design an objective that explicitly promotes it. Empirically, MetricRL consistently outperforms prior state-of-the-art goal-conditioned RL methods in recovering near-optimal behavior from sub-optimal offline data.
翻译:本研究探讨在稀疏奖励、可逆动作与确定性转移条件下,从次优数据集中学习最优行为的目标条件离线强化学习问题。为缓解分布偏移效应,我们提出MetricRL方法,该方法将价值函数逼近的度量学习与策略估计的加权模仿学习相结合。MetricRL避免了保守约束或行为克隆限制,使其即使在严重次优条件下也能实现有效学习。我们提出距离单调性作为连接度量表示与最优性的关键属性,并设计了显式促进该属性的目标函数。实验表明,在从次优离线数据恢复近似最优行为方面,MetricRL始终优于现有最先进的目标条件强化学习方法。