In Reinforcement Learning (RL), agents have no incentive to exhibit predictable behaviors, and are often pushed (through e.g. policy entropy regularization) to randomize their actions in favor of exploration. From a human perspective, this makes RL agents hard to interpret and predict, and from a safety perspective, even harder to formally verify. We propose a novel method to induce predictable behavior in RL agents, referred to as Predictability-Aware RL (PA-RL), which employs the state sequence entropy rate as a predictability measure. We show how the entropy rate can be formulated as an average reward objective, and since its entropy reward function is policy-dependent, we introduce an action-dependent surrogate entropy enabling the use of policy-gradient methods. We prove that deterministic policies minimising the average surrogate reward exist and also minimize the actual entropy rate, and show how, given a learned dynamical model, we are able to approximate the value function associated to the true entropy rate. Finally, we demonstrate the effectiveness of the approach in RL tasks inspired by human-robot use-cases, and show how it produces agents with more predictable behavior while achieving near-optimal rewards.
翻译:在强化学习(RL)中,智能体没有动机表现出可预测的行为,反而常常(例如通过策略熵正则化)被推动随机化其动作以促进探索。从人类视角来看,这使得强化学习智能体难以解释和预测;从安全视角来看,甚至更难进行形式化验证。我们提出了一种诱导强化学习智能体产生可预测行为的新方法,称为可预测感知强化学习(PA-RL),该方法采用状态序列熵率作为可预测性度量。我们展示了如何将熵率表述为平均奖励目标,并由于其熵奖励函数依赖于策略,我们引入了一种基于动作的替代熵,使得策略梯度方法得以应用。我们证明了存在最小化平均替代奖励的确定性策略,该策略同时也最小化实际熵率,并展示了在给定学习到的动态模型时,如何逼近与真实熵率对应的价值函数。最后,我们在受机器人-人类交互用例启发的强化学习任务中验证了该方法的有效性,并展示了它如何生成行为更具可预测性的智能体,同时实现接近最优的奖励。