We cast episodic Markov decision process (MDP) planning as Bayesian inference over _policies_. A policy is treated as the latent variable and is assigned an unnormalized probability of optimality that is monotone in its expected return, yielding a posterior distribution whose modes coincide with return-maximizing solutions while posterior dispersion represents uncertainty over optimal behavior. To approximate this posterior in discrete domains, we adapt variational sequential Monte Carlo (VSMC) to inference over deterministic policies under stochastic dynamics, introducing a sweep that enforces policy consistency across revisited states and couples transition randomness across particles to avoid confounding from simulator noise. Acting is performed by posterior predictive sampling, which induces a stochastic control policy through a Thompson-sampling interpretation rather than entropy regularization. Across grid worlds, Blackjack, Triangle Tireworld, and Academic Advising, we analyze the structure of inferred policy distributions and compare the resulting behavior to discrete Soft Actor-Critic, highlighting qualitative and statistical differences that arise from policy-level uncertainty.
翻译:我们将情景式马尔可夫决策过程(MDP)规划问题构建为关于_策略_的贝叶斯推断。策略被视为隐变量,并被赋予一个未归一化的最优性概率,该概率随其期望回报单调变化,从而产生一个后验分布——其众数与回报最大化解一致,而后验离散度则反映了最优行为的不确定性。为在离散领域中近似该后验分布,我们将变分序贯蒙特卡洛(VSMC)方法适配于随机动态下确定性策略的推断,引入一种扫描机制:强制重访状态间的策略一致性,并通过耦合粒子间的转移随机性来避免模拟器噪声带来的混淆。行动执行采用后验预测采样实现,该方法通过汤普森采样解释(而非熵正则化)诱导出随机控制策略。在网格世界、Blackjack、Triangle Tireworld和Academic Advising等环境中,我们分析了推断策略分布的结构,并将所得行为与离散Soft Actor-Critic进行对比,重点阐明了由策略层面不确定性产生的定性与统计差异。