Preference-based Reinforcement Learning (PbRL) methods utilize binary feedback from the human in the loop (HiL) over queried trajectory pairs to learn a reward model in an attempt to approximate the human's underlying reward function capturing their preferences. In this work, we investigate the issue of a high degree of variability in the initialized reward models which are sensitive to random seeds of the experiment. This further compounds the issue of degenerate reward functions PbRL methods already suffer from. We propose a data-driven reward initialization method that does not add any additional cost to the human in the loop and negligible cost to the PbRL agent and show that doing so ensures that the predicted rewards of the initialized reward model are uniform in the state space and this reduces the variability in the performance of the method across multiple runs and is shown to improve the overall performance compared to other initialization methods.
翻译:基于偏好的强化学习(PbRL)方法利用人类在环中(HiL)对被查询轨迹对提供的二值反馈来学习一个奖励模型,以试图近似人类捕捉其偏好的底层奖励函数。在本工作中,我们研究了初始化奖励模型中存在的高度变异问题,该变异对实验的随机种子敏感。这进一步加剧了PbRL方法本就存在的退化奖励函数问题。我们提出了一种数据驱动的奖励初始化方法,该方法不会给人类在环增加额外成本,仅给PbRL代理带来可忽略的成本,并表明这样做能确保初始化奖励模型的预测奖励在状态空间中均匀分布,从而减少该方法在多次运行中的性能变异,并且与其他初始化方法相比,能提升整体性能。