In machine learning for sequential decision-making, an algorithmic agent learns to interact with an environment while receiving feedback in the form of a reward signal. However, in many unstructured real-world settings, such a reward signal is unknown and humans cannot reliably craft a reward signal that correctly captures desired behavior. To solve tasks in such unstructured and open-ended environments, we present Demonstration-Inferred Preference Reinforcement Learning (DIP-RL), an algorithm that leverages human demonstrations in three distinct ways, including training an autoencoder, seeding reinforcement learning (RL) training batches with demonstration data, and inferring preferences over behaviors to learn a reward function to guide RL. We evaluate DIP-RL in a tree-chopping task in Minecraft. Results suggest that the method can guide an RL agent to learn a reward function that reflects human preferences and that DIP-RL performs competitively relative to baselines. DIP-RL is inspired by our previous work on combining demonstrations and pairwise preferences in Minecraft, which was awarded a research prize at the 2022 NeurIPS MineRL BASALT competition, Learning from Human Feedback in Minecraft. Example trajectory rollouts of DIP-RL and baselines are located at https://sites.google.com/view/dip-rl.
翻译:在面向序贯决策的机器学习中,算法智能体通过与环境的交互来学习,并接收以奖励信号形式提供的反馈。然而,在众多非结构化现实场景中,这种奖励信号往往是未知的,且人类难以可靠地设计出能准确捕获期望行为的奖励信号。为解决此类非结构化开放环境中的任务,我们提出基于演示推断偏好的强化学习(DIP-RL)算法。该算法通过三种不同方式利用人类演示:训练自编码器、用演示数据对强化学习训练批次进行种子初始化,以及推断行为偏好以学习引导强化学习的奖励函数。我们在Minecraft的伐木任务中评估了DIP-RL。结果表明,该方法能引导强化学习智能体学习反映人类偏好的奖励函数,且DIP-RL相对于基线方法具有竞争力的表现。DIP-RL的灵感来自我们先前在Minecraft中结合演示与成对偏好的工作,该工作曾荣获2022年NeurIPS MineRL BASALT竞赛"从人类反馈中学习"研究奖项。DIP-RL与基线方法的示例轨迹演示请参见https://sites.google.com/view/dip-rl。