Partially Observable Markov Decision Processes (POMDPs) can model complex sequential decision-making problems under stochastic and uncertain environments. A main reason hindering their broad adoption in real-world applications is the lack of availability of a suitable POMDP model or a simulator thereof. Available solution algorithms, such as Reinforcement Learning (RL), require the knowledge of the transition dynamics and the observation generating process, which are often unknown and non-trivial to infer. In this work, we propose a combined framework for inference and robust solution of POMDPs via deep RL. First, all transition and observation model parameters are jointly inferred via Markov Chain Monte Carlo sampling of a hidden Markov model, which is conditioned on actions, in order to recover full posterior distributions from the available data. The POMDP with uncertain parameters is then solved via deep RL techniques with the parameter distributions incorporated into the solution via domain randomization, in order to develop solutions that are robust to model uncertainty. As a further contribution, we compare the use of transformers and long short-term memory networks, which constitute model-free RL solutions, with a model-based/model-free hybrid approach. We apply these methods to the real-world problem of optimal maintenance planning for railway assets.
翻译:部分可观察马尔可夫决策过程(POMDP)可对随机和不确定环境下的复杂序列决策问题进行建模。阻碍该类模型在实际应用中广泛推广的主要原因在于缺乏合适的POMDP模型或相应仿真器。现有求解算法(如强化学习)需要掌握转移动力学和观测生成过程的知识,而这些过程往往未知且难以推断。本研究提出一种结合深度强化学习的POMDP推理与鲁棒求解框架。首先,通过基于动作条件约束的隐马尔可夫模型进行马尔可夫链蒙特卡洛采样,联合推断所有转移与观测模型参数,从而从现有数据中恢复全后验分布。随后采用深度强化学习技术求解具有不确定参数的POMDP,通过领域随机化将参数分布融入求解过程,以开发对模型不确定性具有鲁棒性的解决方案。作为进一步贡献,本文比较了两类无模型强化学习方案——Transformer与长短期记忆网络,以及混合模型驱动/数据驱动方法的性能。我们将所提方法应用于铁路资产最优维护规划的实际问题。