Multi-agent planning and reinforcement learning can be challenging when agents cannot see the state of the world or communicate with each other due to communication costs, latency, or noise. Partially Observable Stochastic Games (POSGs) provide a mathematical framework for modelling such scenarios. This paper aims to improve the efficiency of planning and reinforcement learning algorithms for POSGs by identifying the underlying structure of optimal state-value functions. The approach involves reformulating the original game from the perspective of a trusted third party who plans on behalf of the agents simultaneously. From this viewpoint, the original POSGs can be viewed as Markov games where states are occupancy states, \ie posterior probability distributions over the hidden states of the world and the stream of actions and observations that agents have experienced so far. This study mainly proves that the optimal state-value function is a convex function of occupancy states expressed on an appropriate basis in all zero-sum, common-payoff, and Stackelberg POSGs.
翻译:多智能体规划和强化学习在智能体无法观测世界状态,或由于通信成本、延迟或噪声无法相互通信时具有挑战性。部分可观测随机博弈(POSGs)为建模此类场景提供了数学框架。本文旨在通过识别最优状态值函数的潜在结构,提高POSGs中规划和强化学习算法的效率。该方法从代表所有智能体同时进行规划的可信第三方的视角重新表述原始博弈。从这一视角出发,原始POSGs可被视为状态为占据状态的马尔可夫博弈,即关于世界隐藏状态以及智能体迄今为止经历的动作和观察序列的后验概率分布。本研究主要证明了在所有零和、公共收益和Stackelberg POSGs中,最优状态值函数是在适当基上表达的占据状态的凸函数。