In preference-based reinforcement learning (PbRL), a reward function is learned from a type of human feedback called preference. To expedite preference collection, recent works have leveraged \emph{offline preferences}, which are preferences collected for some offline data. In this scenario, the learned reward function is fitted on the offline data. If a learning agent exhibits behaviors that do not overlap with the offline data, the learned reward function may encounter generalizability issues. To address this problem, the present study introduces a framework that consolidates offline preferences and \emph{virtual preferences} for PbRL, which are comparisons between the agent's behaviors and the offline data. Critically, the reward function can track the agent's behaviors using the virtual preferences, thereby offering well-aligned guidance to the agent. Through experiments on continuous control tasks, this study demonstrates the effectiveness of incorporating the virtual preferences in PbRL.
翻译:在基于偏好的强化学习(PbRL)中,奖励函数是从一种称为偏好的人类反馈中学习得到的。为了加速偏好收集,近期研究利用了离线偏好,即针对某些离线数据收集的偏好。在此场景下,学习得到的奖励函数在离线数据上拟合。若学习智能体展现出与离线数据无重叠的行为,该奖励函数可能面临泛化性问题。为解决该问题,本研究提出一种框架,将离线偏好与虚拟偏好整合到PbRL中——虚拟偏好即智能体行为与离线数据之间的比较。关键在于,奖励函数能通过虚拟偏好跟踪智能体的行为,从而为智能体提供对齐良好的引导。通过在连续控制任务上的实验,本研究证明了在PbRL中融入虚拟偏好的有效性。