Preference-based reinforcement learning (PbRL) promises to learn a complex reward function with binary human preference. However, such human-in-the-loop formulation requires considerable human effort to assign preference labels to segment pairs, hindering its large-scale applications. Recent approache has tried to reuse unlabeled segments, which implicitly elucidates the distribution of segments and thereby alleviates the human effort. And consistency regularization is further considered to improve the performance of semi-supervised learning. However, we notice that, unlike general classification tasks, in PbRL there exits a unique phenomenon that we defined as similarity trap in this paper. Intuitively, human can have diametrically opposite preferredness for similar segment pairs, but such similarity may trap consistency regularization fail in PbRL. Due to the existence of similarity trap, such consistency regularization improperly enhances the consistency possiblity of the model's predictions between segment pairs, and thus reduces the confidence in reward learning, since the augmented distribution does not match with the original one in PbRL. To overcome such issue, we present a self-training method along with our proposed peer regularization, which penalizes the reward model memorizing uninformative labels and acquires confident predictions. Empirically, we demonstrate that our approach is capable of learning well a variety of locomotion and robotic manipulation behaviors using different semi-supervised alternatives and peer regularization.
翻译:基于偏好的强化学习(PbRL)有望通过二元人类偏好学习复杂的奖励函数。然而,这种人在环路范式需要大量人力为片段对分配偏好标签,阻碍了其大规模应用。近期方法尝试复用未标注片段,通过隐式揭示片段分布来减轻人力负担,并进一步引入一致性正则化以提升半监督学习性能。然而我们注意到,与通用分类任务不同,PbRL中存在一种独特现象——本文将其定义为相似性陷阱。直观而言,人类对相似片段对可能持有截然相反的偏好倾向,但这种相似性会导致一致性正则化在PbRL中失效。由于相似性陷阱的存在,这类一致性正则化不当增强了模型对片段对预测的一致性概率,从而降低了奖励学习的置信度,因为增强分布与PbRL中原始分布不匹配。为克服该问题,我们提出自训练方法并引入同伴正则化,通过惩罚奖励模型记忆无信息标签来获取高置信度预测。实证表明,我们的方法能够通过不同半监督变体与同伴正则化有效学习多种运动控制及机器人操作行为。