Deep reinforcement learning (RL) policies, although optimal in terms of task rewards, may not align with the personal preferences of human users. To ensure this alignment, a naive solution would be to retrain the agent using a reward function that encodes the user's specific preferences. However, such a reward function is typically not readily available, and as such, retraining the agent from scratch can be prohibitively expensive. We propose a more practical approach - to adapt the already trained policy to user-specific needs with the help of human feedback. To this end, we infer the user's intent through trajectory-level feedback and combine it with the trained task policy via a theoretically grounded dynamic policy fusion approach. As our approach collects human feedback on the very same trajectories used to learn the task policy, it does not require any additional interactions with the environment, making it a zero-shot approach. We empirically demonstrate in a number of environments that our proposed dynamic policy fusion approach consistently achieves the intended task while simultaneously adhering to user-specific needs.
翻译:深度强化学习(RL)策略虽然在任务奖励方面达到最优,但可能无法符合人类用户的个人偏好。为确保这种一致性,一种简单的方法是使用编码用户特定偏好的奖励函数重新训练智能体。然而,此类奖励函数通常不易获得,且从头开始重新训练智能体的成本可能过高。我们提出一种更实用的方法——借助人类反馈,将已训练的策略适应用户特定需求。为此,我们通过轨迹级反馈推断用户意图,并通过理论依据充分的动态策略融合方法将其与已训练的任务策略相结合。由于我们的方法在用于学习任务策略的同一轨迹上收集人类反馈,因此无需与环境进行任何额外交互,使其成为一种零样本方法。我们在多个环境中通过实验证明,所提出的动态策略融合方法在实现预期任务的同时,始终能适应用户的特定需求。