Reward Machines provide an automata-inspired structure for specifying instructions, safety constraints, and other temporally extended reward-worthy behaviour. By exposing complex reward function structure, they enable counterfactual learning updates that have resulted in impressive sample efficiency gains. While Reward Machines have been employed in both tabular and deep RL settings, they have typically relied on a ground-truth interpretation of the domain-specific vocabulary that form the building blocks of the reward function. Such ground-truth interpretations can be elusive in many real-world settings, due in part to partial observability or noisy sensing. In this paper, we explore the use of Reward Machines for Deep RL in noisy and uncertain environments. We characterize this problem as a POMDP and propose a suite of RL algorithms that leverage task structure under uncertain interpretation of domain-specific vocabulary. Theoretical analysis exposes pitfalls in naive approaches to this problem, while experimental results show that our algorithms successfully leverage task structure to improve performance under noisy interpretations of the vocabulary. Our results provide a general framework for exploiting Reward Machines in partially observable environments.
翻译:奖励机制提供了一种基于自动机理论的结构,用于描述指令、安全约束及其他具有时间延展性的奖励行为。通过揭示复杂奖励函数的内在结构,该机制能够实现反事实学习更新,从而显著提升样本效率。尽管奖励机制已在表格型与深度强化学习场景中得到应用,但其通常依赖于对领域特定词汇(即构成奖励函数的基本单元)的真实语义解释。然而在现实场景中,由于部分可观测性或感知噪声的存在,这种真实语义解释往往难以获得。本文探索了奖励机制在噪声与不确定性环境中的深度强化学习应用。我们将该问题建模为部分可观测马尔可夫决策过程,并提出了一系列能够利用任务结构并适应领域词汇不确定性解释的强化学习算法。理论分析揭示了针对该问题的朴素方法存在的缺陷,而实验结果表明:在词汇存在噪声解释的情况下,我们的算法能有效利用任务结构以提升性能。本研究为在部分可观测环境中运用奖励机制提供了通用框架。