Aligning robot behavior with human preferences is crucial for deploying embodied AI agents in human-centered environments. A promising solution is interactive imitation learning from human intervention, where a human expert observes the policy's execution and provides interventions as feedback. However, existing methods often fail to utilize the prior policy efficiently to facilitate learning, thus hindering sample efficiency. In this work, we introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning), designed for sample-efficient alignment from human intervention. Instead of inferring the complete human behavior characteristics, MEReQ infers a residual reward function that captures the discrepancy between the human expert's and the prior policy's underlying reward functions. It then employs Residual Q-Learning (RQL) to align the policy with human preferences using this residual reward function. Extensive evaluations on simulated and real-world tasks demonstrate that MEReQ achieves sample-efficient policy alignment from human intervention.
翻译:使机器人行为与人类偏好对齐对于在以人为本的环境中部署具身智能体至关重要。一种有前景的解决方案是基于人类干预的交互式模仿学习,即人类专家观察策略执行过程并提供干预作为反馈。然而,现有方法通常无法有效利用先验策略来促进学习,从而限制了样本效率。本工作提出了MEReQ(最大熵残差Q逆强化学习),专为基于人类干预的样本高效对齐而设计。MEReQ并非推断完整的人类行为特征,而是推断一个残差奖励函数,该函数捕捉人类专家与先验策略底层奖励函数之间的差异。随后,它采用残差Q学习(RQL),利用此残差奖励函数将策略与人类偏好对齐。在模拟和真实任务上的大量评估表明,MEReQ能够通过人类干预实现样本高效的政策对齐。