Partially Observable Markov Decision Processes (POMDPs) are used to model environments where the full state cannot be perceived by an agent. As such the agent needs to reason taking into account the past observations and actions. However, simply remembering the full history is generally intractable due to the exponential growth in the history space. Maintaining a probability distribution that models the belief over what the true state is can be used as a sufficient statistic of the history, but its computation requires access to the model of the environment and is often intractable. While SOTA algorithms use Recurrent Neural Networks to compress the observation-action history aiming to learn a sufficient statistic, they lack guarantees of success and can lead to sub-optimal policies. To overcome this, we propose the Wasserstein Belief Updater, an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update. Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our outputted beliefs allow for learning the optimal value function.
翻译:部分可观测马尔可夫决策过程(POMDPs)用于建模智能体无法感知完整状态的环境。在此类环境中,智能体需结合历史观测和行动进行推理。然而,由于历史空间的指数级增长,简单记忆完整历史通常不可行。维护一个建模真实状态信念的概率分布可作为历史的充分统计量,但其计算需要访问环境模型,且往往难以处理。尽管现有最先进算法使用循环神经网络压缩观测-行动历史以学习充分统计量,但这些方法缺乏成功保证,并可能导致次优策略。为解决此问题,我们提出了Wasserstein信念更新器——一种学习POMDP潜模型及信念更新近似值的强化学习算法。我们的方法对近似质量提供了理论保证,确保输出的信念能够支持最优价值函数的学习。