In this work, we generalize the problem of learning through interaction in a POMDP by accounting for eventual additional information available at training time. First, we introduce the informed POMDP, a new learning paradigm offering a clear distinction between the training information and the execution observation. Next, we propose an objective for learning a sufficient statistic from the history for the optimal control that leverages this information. We then show that this informed objective consists of learning an environment model from which we can sample latent trajectories. Finally, we show for the Dreamer algorithm that the convergence speed of the policies is sometimes greatly improved on several environments by using this informed environment model. Those results and the simplicity of the proposed adaptation advocate for a systematic consideration of eventual additional information when learning in a POMDP using model-based RL.
翻译:本文中,我们通过考虑训练过程中可能获取的额外信息,推广了部分可观测马尔可夫决策过程(POMDP)中通过交互进行学习的问题。首先,我们引入知情POMDP这一新学习范式,明确区分了训练信息与执行观测。接着,我们提出一个学习目标,旨在从历史中提取用于最优控制的充分统计量,并利用该额外信息。然后,我们证明这一知情目标本质上是在学习一个环境模型,并从中采样潜在轨迹。最后,我们针对Dreamer算法展示,通过使用该知情环境模型,策略的收敛速度在多个环境中有时得到显著提升。上述结果及所提自适应方案的简便性,支持在基于模型的强化学习中应对POMDP学习时系统地考虑可能的额外信息。