Inverse reinforcement learning (IRL) aims to explicitly infer an underlying reward function based on collected expert demonstrations. Considering that obtaining expert demonstrations can be costly, the focus of current IRL techniques is on learning a better-than-demonstrator policy using a reward function derived from sub-optimal demonstrations. However, existing IRL algorithms primarily tackle the challenge of trajectory ranking ambiguity when learning the reward function. They overlook the crucial role of considering the degree of difference between trajectories in terms of their returns, which is essential for further removing reward ambiguity. Additionally, it is important to note that the reward of a single transition is heavily influenced by the context information within the trajectory. To address these issues, we introduce the Distance-rank Aware Sequential Reward Learning (DRASRL) framework. Unlike existing approaches, DRASRL takes into account both the ranking of trajectories and the degrees of dissimilarity between them to collaboratively eliminate reward ambiguity when learning a sequence of contextually informed reward signals. Specifically, we leverage the distance between policies, from which the trajectories are generated, as a measure to quantify the degree of differences between traces. This distance-aware information is then used to infer embeddings in the representation space for reward learning, employing the contrastive learning technique. Meanwhile, we integrate the pairwise ranking loss function to incorporate ranking information into the latent features. Moreover, we resort to the Transformer architecture to capture the contextual dependencies within the trajectories in the latent space, leading to more accurate reward estimation. Through extensive experimentation, our DRASRL framework demonstrates significant performance improvements over previous SOTA methods.
翻译:逆强化学习旨在基于收集的专家演示显式推断潜在奖励函数。鉴于获取专家演示成本高昂,当前逆强化学习技术的重点在于利用从次优演示中推导的奖励函数,学习优于演示者的策略。然而,现有逆强化学习算法主要解决学习奖励函数时的轨迹排序歧义问题,却忽视了考虑轨迹在回报方面差异程度的关键作用——这一环节对进一步消除奖励歧义至关重要。此外,需特别指出的是,单个状态转移的奖励深受轨迹内上下文信息的影响。为解决上述问题,我们提出距离感知序贯奖励学习框架。与现有方法不同,DRASRL同时考虑轨迹排序与轨迹间差异程度,协同消除学习上下文感知奖励信号序列时的奖励歧义。具体而言,我们利用生成轨迹的策略之间的距离作为量化轨迹差异程度的度量。通过对比学习技术,将这种距离感知信息用于推理奖励学习的表征空间嵌入。同时,我们集成成对排序损失函数,将排序信息融入潜在特征。更进一步,我们采用Transformer架构在潜在空间中捕捉轨迹内的上下文依赖关系,从而获得更精确的奖励估计。大量实验表明,我们的DRASRL框架相较于现有最优方法实现了显著的性能提升。