Multi-Environment MDPs with Prior and Universal Semantics

Multiple-environment Markov decision processes (MEMDPs) equip an MDP with several probabilistic transition functions (one per possible environment) so that the state is observable but the environment is not. Previous work studies two semantics: (i) the universal semantics, where an adversary picks the environment; and (ii) the prior semantics, where the environment is drawn once before execution from a fixed distribution. We clarify the relation between these semantics. For parity objectives, we show that the qualitative questions, i.e. value one, coincide, and we develop a new algorithm for the general value of MEMDP with prior semantics. In particular, we show that the prior value of an MEMDP with a parity objective can be approximated to any precision with a space efficient algorithm; equivalently, the associated gap problem is decidable in PSPACE when probabilities are given in unary (and in EXPSPACE otherwise). We then prove that the universal value equals the infimum of prior values over all beliefs. This yields a new algorithm for the universal gap problem with the same complexity (PSPACE for unary probabilities, EXPSPACE in general), improving on earlier doubly-exponential-space procedures. Finally, we observe that MEMDPs under the prior semantics form an important tractable subclass of POMDPs: our algorithms exploit the fact that belief entropy never increases, and we establish that any POMDP with this property reduces effectively to a prior-MEMDP, showing that prior-MEMDPs capture a broad and practically relevant subclass of POMDPs.

翻译：多环境马尔可夫决策过程（MEMDP）为MDP配备了多个概率转移函数（每个可能环境对应一个），使得状态可观测而环境不可观测。先前的研究探讨了两种语义：（i）通用语义，由对手选择环境；（ii）先验语义，环境在执行前从固定分布中抽取一次。我们厘清了这两种语义之间的关系。针对奇偶性目标，我们证明定性问题（即值为一）是等价的，并为具有先验语义的MEMDP的一般值开发了一种新算法。特别地，我们证明了具有奇偶性目标的MEMDP的先验值可通过空间高效算法以任意精度逼近；等价地，当概率以一元形式给出时，相关的间隙问题可在PSPACE内判定（否则在EXPSPACE内）。随后，我们证明通用值等于所有信念下先验值的下确界。这为通用间隙问题提供了一种具有相同复杂度（一元概率时为PSPACE，一般情况为EXPSPACE）的新算法，改进了先前需要双指数空间的过程。最后，我们指出先验语义下的MEMDP构成了部分可观测马尔可夫决策过程（POMDP）的一个重要易处理子类：我们的算法利用了信念熵永不增加的特性，并证明任何具有此性质的POMDP均可有效约简为先验MEMDP，这表明先验MEMDP涵盖了POMDP中广泛且具有实际意义的子类。

相关内容

马尔可夫决策过程

关注 23

马尔可夫决策过程（MDP）提供了一个数学框架，用于在结果部分随机且部分受决策者控制的情况下对决策建模。 MDP可用于研究通过动态编程和强化学习解决的各种优化问题。 MDP至少早在1950年代就已为人所知（参见）。马尔可夫决策过程的研究核心是罗纳德·霍华德（Ronald A. Howard）于1960年出版的《动态编程和马尔可夫过程》一书。它们被广泛用于各种学科，包括机器人技术，自动控制，经济学和制造。更精确地，马尔可夫决策过程是离散的时间随机控制过程。在每个时间步骤中，流程都处于某种状态，决策者可以选择该状态下可用的任何操作。该过程在下一时间步响应，随机进入新状态，并给予决策者相应的奖励。流程进入新状态的可能性受所选动作的影响。具体而言，它由状态转换函数给出。因此，下一个状态取决于当前状态和决策者的动作。但是给定和，它有条件地独立于所有先前的状态和动作；换句话说，MDP进程的状态转换满足Markov属性。马尔可夫决策过程是马尔可夫链的扩展。区别在于增加了动作（允许选择）和奖励（给予动机）。相反，如果每个状态仅存在一个动作（例如“等待”）并且所有奖励都相同（例如“零”），则马尔可夫决策过程将简化为马尔可夫链。

《多智能体大语言模型系统的可靠决策研究》

专知会员服务

40+阅读 · 2月2日

《自适应鲁棒马尔可夫决策过程：协同作战飞机（CCA）对抗性监视任务应用》44页技术报告

专知会员服务

26+阅读 · 2025年12月9日

《分布式多智能体强化学习策略的可解释性研究》

专知会员服务

28+阅读 · 2025年11月17日

《论多智能体决策的复杂性：从博弈学习到部分监控》115页

专知会员服务

50+阅读 · 2025年2月26日