Preference-based reinforcement learning (RL) has emerged as a new field in robot learning, where humans play a pivotal role in shaping robot behavior by expressing preferences on different sequences of state-action pairs. However, formulating realistic policies for robots demands responses from humans to an extensive array of queries. In this work, we approach the sample-efficiency challenge by expanding the information collected per query to contain both preferences and optional text prompting. To accomplish this, we leverage the zero-shot capabilities of a large language model (LLM) to reason from the text provided by humans. To accommodate the additional query information, we reformulate the reward learning objectives to contain flexible highlights -- state-action pairs that contain relatively high information and are related to the features processed in a zero-shot fashion from a pretrained LLM. In both a simulated scenario and a user study, we reveal the effectiveness of our work by analyzing the feedback and its implications. Additionally, the collective feedback collected serves to train a robot on socially compliant trajectories in a simulated social navigation landscape. We provide video examples of the trained policies at https://sites.google.com/view/rl-predilect
翻译:基于偏好的强化学习(RL)已成为机器人学习领域的一个新兴方向,其中人类通过表达对不同状态-动作序列的偏好,在塑造机器人行为中发挥关键作用。然而,为机器人制定现实策略需要人类对大量查询做出响应。在本工作中,我们通过扩展每次查询收集的信息(同时包含偏好和可选的文本提示)来应对样本效率挑战。为此,我们利用大语言模型(LLM)的零样本能力,从人类提供的文本中进行推理。为适配额外查询信息,我们重新制定了奖励学习目标,引入灵活高亮机制——这些状态-动作对包含较高信息量,且与从预训练LLM中以零样本方式处理的特征相关。通过模拟场景和用户研究,我们分析了反馈及其影响,验证了本工作的有效性。此外,收集到的集体反馈用于在模拟社交导航环境中训练机器人遵循社交合规轨迹。训练策略的视频示例见 https://sites.google.com/view/rl-predilect