Sequential decision-making problems are often modelled as a Markov decision process (MDP). We focus on the stochastic shortest path (SSP) problem, which is an infinite-horizon undiscounted MDP with absorbing terminal states. We develop a Bayesian framework to learn the optimal decision strategy through interactions with the decision-making task. Specifically, we learn the optimal action-value function $Q^*$, but unlike many existing Bayesian approaches, we do not rely on unrealistic modelling assumptions and ad-hoc approximations. Our approach is to directly construct the posterior beliefs for $Q^*$ through Bellman's optimality equations. For deterministic rewards, we characterise the posterior as a distribution with a manifold density. To facilitate simpler inference, we relax the likelihood so that a Lebesgue density exists. The flip side is to create unidentifiability issues. Specifically, the relaxed posterior can have significant mass on improper decision rules, while the exact posterior will not. We also calculate the exact posterior probabilities for optimal action selections for the tabular parametrisation of $Q^*$, a Gaussian likelihood relaxation and a Gaussian prior, which is useful in benchmarking studies. Numerical studies on variants of the Deep Sea benchmark verify our findings. We demonstrate that our framework faithfully quantifies uncertainty and, compared to other temporal-difference-based Bayesian methodologies, is more data efficient. We conclude with recommendations for future work.
翻译:序贯决策问题通常被建模为马尔可夫决策过程(MDP)。我们聚焦于随机最短路径(SSP)问题,这是一个具有吸收终止状态的无限时域无折扣MDP。我们开发了一个贝叶斯框架,通过与决策任务交互来学习最优决策策略。具体而言,我们学习最优动作价值函数 $Q^*$,但与许多现有贝叶斯方法不同,我们并不依赖不切实际的建模假设和临时的近似。我们的方法是通过贝尔曼最优方程直接构建 $Q^*$ 的后验信念。对于确定性奖励,我们将后验刻画为具有流形密度的分布。为简化推断,我们放松似然函数使得勒贝格密度存在。其代价是产生了不可辨识性问题:具体来说,放松后的后验可能在非最优决策规则上具有显著质量,而精确后验则不会。我们还针对 $Q^*$ 的表格化参数化、高斯似然放松和高斯先验,计算了最优动作选择的精确后验概率,这在基准测试研究中具有实用价值。基于深海基准变体的数值研究验证了我们的发现。我们证明,该框架能够准确量化不确定性,并且与其他基于时序差分法的贝叶斯方法相比,具有更高的数据效率。最后,我们提出了对未来工作的建议。