Maximizing Alignment with Minimal Feedback: Efficiently Learning Rewards for Visuomotor Robot Policy Alignment

from arxiv, Submitted to IJRR, this paper is an extended journal version of the conference paper arXiv:2310.07932 with new results and discussion. arXiv admin note: substantial text overlap with arXiv:2310.07932

Visuomotor robot policies, increasingly pre-trained on large-scale datasets, promise significant advancements across robotics domains. However, aligning these policies with end-user preferences remains a challenge, particularly when the preferences are hard to specify. While reinforcement learning from human feedback (RLHF) has become the predominant mechanism for alignment in non-embodied domains like large language models, it has not seen the same success in aligning visuomotor policies due to the prohibitive amount of human feedback required to learn visual reward functions. To address this limitation, we propose Representation-Aligned Preference-based Learning (RAPL), an observation-only method for learning visual rewards from significantly less human preference feedback. Unlike traditional RLHF, RAPL focuses human feedback on fine-tuning pre-trained vision encoders to align with the end-user's visual representation and then constructs a dense visual reward via feature matching in this aligned representation space. We first validate RAPL through simulation experiments in the X-Magical benchmark and Franka Panda robotic manipulation, demonstrating that it can learn rewards aligned with human preferences, more efficiently uses preference data, and generalizes across robot embodiments. Finally, our hardware experiments align pre-trained Diffusion Policies for three object manipulation tasks. We find that RAPL can fine-tune these policies with 5x less real human preference data, taking the first step towards minimizing human feedback while maximizing visuomotor robot policy alignment.

翻译：视觉运动机器人策略日益基于大规模数据集进行预训练，有望推动机器人学各领域取得显著进展。然而，将这些策略与终端用户偏好对齐仍然存在挑战，尤其在偏好难以明确指定的情况下。尽管基于人类反馈的强化学习（RLHF）已成为非具身领域（如大语言模型）中对齐的主要机制，但由于学习视觉奖励函数所需的人类反馈量过大，该方法在视觉运动策略对齐中尚未取得同等成功。为应对这一局限，我们提出表征对齐偏好学习（RAPL），这是一种仅需观测数据的方法，能够利用显著更少的人类偏好反馈来学习视觉奖励。与传统RLHF不同，RAPL将人类反馈集中于微调预训练视觉编码器，使其与终端用户的视觉表征对齐，随后通过在该对齐表征空间中进行特征匹配来构建稠密视觉奖励。我们首先在X-Magical基准测试和Franka Panda机器人操作仿真实验中验证RAPL，证明其能够学习与人类偏好对齐的奖励函数，更高效地利用偏好数据，并能在不同机器人具身形态间泛化。最后，我们在硬件实验中针对三项物体操作任务对齐预训练的扩散策略。研究发现，RAPL能够以仅需五分之一真实人类偏好数据量微调这些策略，为在最大化视觉运动机器人策略对齐的同时最小化人类反馈迈出了关键一步。