Generating human-like behavior on robots is a great challenge especially in dexterous manipulation tasks with robotic hands. Scripting policies from scratch is intractable due to the high-dimensional control space, and training policies with reinforcement learning (RL) and manual reward engineering can also be hard and lead to unnatural motions. Leveraging the recent progress on RL from Human Feedback, we propose a framework that learns a universal human prior using direct human preference feedback over videos, for efficiently tuning the RL policies on 20 dual-hand robot manipulation tasks in simulation, without a single human demonstration. A task-agnostic reward model is trained through iteratively generating diverse polices and collecting human preference over the trajectories; it is then applied for regularizing the behavior of polices in the fine-tuning stage. Our method empirically demonstrates more human-like behaviors on robot hands in diverse tasks including even unseen tasks, indicating its generalization capability.
翻译:在机器人上生成类似人类的行为是一项巨大挑战,尤其是在使用机械手进行灵巧操作任务时。由于控制空间维度高,从头编写策略脚本难以实现;而通过强化学习(RL)和手动奖励工程训练策略同样困难且可能导致不自然的动作。借助从人类反馈中进行强化学习的最新进展,我们提出一个框架,利用视频中直接的人类偏好反馈学习通用的人类先验知识,用于在模拟环境中高效调整20个双手机器人操作任务的强化学习策略,且无需任何人类示范。通过迭代生成多样化策略并收集人类对轨迹的偏好,训练出一个任务无关奖励模型;随后在微调阶段用该模型对策略行为进行正则化约束。我们的方法在包括未见任务在内的多种任务中,实验证明机器人手能表现出更类似人类的行为,表明其具备泛化能力。