Preference-aligned robot navigation in human environments is typically achieved through learning-based approaches, utilizing user feedback or demonstrations for personalization. However, personal preferences are subject to change and might even be context-dependent. Yet traditional reinforcement learning (RL) approaches with static reward functions often fall short in adapting to these varying user preferences, inevitably reflecting demonstrations once training is completed. This paper introduces a framework that combines multi-objective reinforcement learning (MORL) with demonstration-based learning. Our approach allows for dynamic adaptation to changing user preferences without retraining. It fluently modulates between reward-defined preference objectives and the amount of demonstration data reflection. Through rigorous evaluations, including a sim-to-real transfer on two robots, we demonstrate our framework's capability to reflect user preferences accurately while achieving high navigational performance in terms of collision avoidance and goal pursuance.
翻译:在人类环境中实现偏好对齐的机器人导航通常通过基于学习的方法实现,利用用户反馈或演示进行个性化定制。然而,个人偏好可能发生变化,甚至可能具有情境依赖性。而采用静态奖励函数的传统强化学习(RL)方法往往难以适应这些变化的用户偏好,一旦训练完成便不可避免地固化演示数据中的行为模式。本文提出了一种将多目标强化学习(MORL)与基于演示的学习相结合的框架。我们的方法能够动态适应变化的用户偏好,无需重新训练。该框架可在奖励定义的偏好目标与演示数据反映程度之间进行流畅调节。通过包括在两个机器人上进行仿真到现实迁移在内的严格评估,我们证明了该框架能够准确反映用户偏好,同时在避障和目标任务达成方面实现卓越的导航性能。