Offline reinforcement learning can enable policy learning from pre-collected, sub-optimal datasets without online interactions. This makes it ideal for real-world robots and safety-critical scenarios, where collecting online data or expert demonstrations is slow, costly, and risky. However, most existing offline RL works assume the dataset is already labeled with the task rewards, a process that often requires significant human effort, especially when ground-truth states are hard to ascertain (e.g., in the real-world). In this paper, we build on prior work, specifically RL-VLM-F, and propose a novel system that automatically generates reward labels for offline datasets using preference feedback from a vision-language model and a text description of the task. Our method then learns a policy using offline RL with the reward-labeled dataset. We demonstrate the system's applicability to a complex real-world robot-assisted dressing task, where we first learn a reward function using a vision-language model on a sub-optimal offline dataset, and then we use the learned reward to employ Implicit Q learning to develop an effective dressing policy. Our method also performs well in simulation tasks involving the manipulation of rigid and deformable objects, and significantly outperform baselines such as behavior cloning and inverse RL. In summary, we propose a new system that enables automatic reward labeling and policy learning from unlabeled, sub-optimal offline datasets.
翻译:离线强化学习能够从预先收集的次优数据集中学习策略,而无需在线交互。这使其成为现实世界机器人和安全关键场景的理想选择,因为在这些场景中,收集在线数据或专家示范既缓慢、成本高昂又具有风险。然而,大多数现有的离线强化学习工作假设数据集已标注了任务奖励,这一过程通常需要大量人力,尤其是在真实状态难以确定的情况下(例如,在现实世界中)。本文基于先前工作,特别是RL-VLM-F,提出了一种新颖的系统,该系统利用视觉语言模型的偏好反馈和任务文本描述,自动为离线数据集生成奖励标签。随后,我们的方法使用带有奖励标签的数据集,通过离线强化学习来学习策略。我们展示了该系统在复杂的现实世界机器人辅助穿衣任务中的适用性:首先,我们在一个次优的离线数据集上使用视觉语言模型学习奖励函数;然后,利用习得的奖励,通过隐式Q学习来开发有效的穿衣策略。我们的方法在涉及刚性和可变形物体操作的模拟任务中也表现良好,并且显著优于行为克隆和逆强化学习等基线方法。总之,我们提出了一种新系统,能够从未标注的次优离线数据集中实现自动奖励标注和策略学习。