Partially Observable Markov Decision Processes (POMDPs) are useful tools to model environments where the full state cannot be perceived by an agent. As such the agent needs to reason taking into account the past observations and actions. However, simply remembering the full history is generally intractable due to the exponential growth in the history space. Keeping a probability distribution that models the belief over what the true state is can be used as a sufficient statistic of the history, but its computation requires access to the model of the environment and is also intractable. Current state-of-the-art algorithms use Recurrent Neural Networks (RNNs) to compress the observation-action history aiming to learn a sufficient statistic, but they lack guarantees of success and can lead to suboptimal policies. To overcome this, we propose the Wasserstein-Belief-Updater (WBU), an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update. Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our outputted beliefs allow for learning the optimal value function.
翻译:部分可观测马尔可夫决策过程(POMDP)是建模智能体无法感知完整状态环境的有效工具。因此,智能体需要结合历史观测与动作进行推理。然而,由于历史空间的指数级增长,简单记忆完整历史通常是不可行的。将建模真实状态信念的概率分布作为历史的充分统计量虽可行,但其计算需要访问环境模型且同样难以处理。当前最优算法采用循环神经网络(RNN)压缩观测-动作历史以学习充分统计量,但缺乏成功保障且可能导致次优策略。为此,我们提出瓦瑟斯坦信念更新器(WBU)——一种学习POMDP潜模型与信念更新近似的强化学习算法。我们的方法对所提近似质量提供理论保证,确保输出的信念能够支撑最优价值函数的学习。