Recent works have studied *state entropy maximization* in reinforcement learning, in which the agent's objective is to learn a policy inducing high entropy over states visitation (Hazan et al., 2019). They typically assume full observability of the state of the system, so that the entropy of the observations is maximized. In practice, the agent may only get *partial* observations, e.g., a robot perceiving the state of a physical space through proximity sensors and cameras. A significant mismatch between the entropy over observations and true states of the system can arise in those settings. In this paper, we address the problem of entropy maximization over the *true states* with a decision policy conditioned on partial observations *only*. The latter is a generalization of POMDPs, which is intractable in general. We develop a memory and computationally efficient *policy gradient* method to address a first-order relaxation of the objective defined on *belief* states, providing various formal characterizations of approximation gaps, the optimization landscape, and the *hallucination* problem. This paper aims to generalize state entropy maximization to more realistic domains that meet the challenges of applications.
翻译:近期研究探索了强化学习中的*状态熵最大化*问题,其目标是学习一种能在状态访问分布上诱导出高熵的策略(Hazan等人,2019)。这类研究通常假设智能体能够完全观测系统状态,从而最大化观测值的熵。然而在实际应用中,智能体往往只能获得*部分*观测,例如机器人通过接近传感器和摄像头感知物理空间的状态。在此类场景中,观测值的熵与系统真实状态的熵之间可能出现显著差异。本文致力于解决仅基于部分观测的条件决策策略下,对*真实状态*进行熵最大化的问题。该问题是部分可观测马尔可夫决策过程(POMDPs)的泛化形式,通常具有计算难解性。我们提出了一种内存与计算高效的*策略梯度*方法,用于处理基于*信念*状态的目标函数的一阶松弛形式,并从理论层面系统分析了近似误差、优化景观及*幻觉*问题。本文旨在将状态熵最大化方法推广至更符合实际应用挑战的领域。