The problem of pure exploration in Markov decision processes has been cast as maximizing the entropy over the state distribution induced by the agent's policy, an objective that has been extensively studied. However, little attention has been dedicated to state entropy maximization under partial observability, despite the latter being ubiquitous in applications, e.g., finance and robotics, in which the agent only receives noisy observations of the true state governing the system's dynamics. How can we address state entropy maximization in those domains? In this paper, we study the simple approach of maximizing the entropy over observations in place of true latent states. First, we provide lower and upper bounds to the approximation of the true state entropy that only depends on some properties of the observation function. Then, we show how knowledge of the latter can be exploited to compute a principled regularization of the observation entropy to improve performance. With this work, we provide both a flexible approach to bring advances in state entropy maximization to the POMDP setting and a theoretical characterization of its intrinsic limits.
翻译:在马尔可夫决策过程中,纯探索问题被定义为最大化智能体策略所诱导的状态分布的熵,这一目标已被广泛研究。然而,在部分可观测性下,状态熵最大化问题却鲜有关注,尽管后者在诸多应用(如金融和机器人学)中普遍存在,其中智能体仅能接收到关于支配系统动态的真实状态的噪声观测。我们应如何在这些领域中解决状态熵最大化问题?本文研究了一种简单方法,即最大化观测熵以替代真实的潜在状态熵。首先,我们给出了真实状态熵近似值的下界和上界,该界限仅取决于观测函数的某些性质。随后,我们展示了如何利用对观测函数的认知来计算观测熵的原则性正则化,以提升性能。通过这项工作,我们不仅提供了一种灵活的方法,将状态熵最大化的进展引入POMDP设定,还从理论上刻画了其内在极限。