This paper develops a novel rating-based reinforcement learning approach that uses human ratings to obtain human guidance in reinforcement learning. Different from the existing preference-based and ranking-based reinforcement learning paradigms, based on human relative preferences over sample pairs, the proposed rating-based reinforcement learning approach is based on human evaluation of individual trajectories without relative comparisons between sample pairs. The rating-based reinforcement learning approach builds on a new prediction model for human ratings and a novel multi-class loss function. We conduct several experimental studies based on synthetic ratings and real human ratings to evaluate the effectiveness and benefits of the new rating-based reinforcement learning approach.
翻译:本文提出了一种新颖的基于评分的强化学习方法,利用人类评分来获取强化学习中的人类指导。与现有的基于偏好和基于排名的强化学习范式不同——这些方法依赖于人类对样本对的相对偏好——本文提出的基于评分的强化学习方法基于人类对单个轨迹的评估,无需进行样本对之间的相对比较。该方法构建了一个用于人类评分预测的新模型以及一种创新的多类损失函数。我们基于合成评分和真实人类评分开展了多项实验研究,以评估这一新型基于评分的强化学习方法的有效性与优势。