Preference-based reinforcement learning (PbRL) is emerging as a promising approach to teaching robots through human comparative feedback, sidestepping the need for complex reward engineering. However, the substantial volume of feedback required in existing PbRL methods often lead to reliance on synthetic feedback generated by scripted teachers. This approach necessitates intricate reward engineering again and struggles to adapt to the nuanced preferences particular to human-robot interaction (HRI) scenarios, where users may have unique expectations toward the same task. To address these challenges, we introduce PrefCLM, a novel framework that utilizes crowdsourced large language models (LLMs) as simulated teachers in PbRL. We utilize Dempster-Shafer Theory to fuse individual preferences from multiple LLM agents at the score level, efficiently leveraging their diversity and collective intelligence. We also introduce a human-in-the-loop pipeline that facilitates collective refinements based on user interactive feedback. Experimental results across various general RL tasks show that PrefCLM achieves competitive performance compared to traditional scripted teachers and excels in facilitating more more natural and efficient behaviors. A real-world user study (N=10) further demonstrates its capability to tailor robot behaviors to individual user preferences, significantly enhancing user satisfaction in HRI scenarios.
翻译:基于偏好的强化学习(PbRL)作为一种通过人类比较反馈来教导机器人的有前景的方法正在兴起,它绕过了复杂奖励工程的需求。然而,现有PbRL方法所需的大量反馈往往导致对脚本化教师生成的合成反馈的依赖。这种方法再次需要复杂的奖励工程,并且难以适应人机交互(HRI)场景中特有的细微偏好,因为用户可能对同一任务有独特的期望。为了应对这些挑战,我们提出了PrefCLM,这是一个利用众包大语言模型(LLM)作为PbRL中模拟教师的新型框架。我们利用Dempster-Shafer理论在分数层面融合来自多个LLM代理的个体偏好,有效利用其多样性和集体智慧。我们还引入了一个人在回路中的流程,以促进基于用户交互反馈的集体优化。在多个通用RL任务上的实验结果表明,与传统脚本化教师相比,PrefCLM实现了具有竞争力的性能,并且在促进行为更自然、更高效方面表现优异。一项真实世界的用户研究(N=10)进一步证明了其根据个体用户偏好定制机器人行为的能力,显著提升了HRI场景中的用户满意度。