We study a sequential decision making problem between a principal and an agent with incomplete information on both sides. In this model, the principal and the agent interact in a stochastic environment, and each is privy to observations about the state not available to the other. The principal has the power of commitment, both to elicit information from the agent and to provide signals about her own information. The principal and the agent communicate their signals to each other, and select their actions independently based on this communication. Each player receives a payoff based on the state and their joint actions, and the environment moves to a new state. The interaction continues over a finite time horizon, and both players act to optimize their own total payoffs over the horizon. Our model encompasses as special cases stochastic games of incomplete information and POMDPs, as well as sequential Bayesian persuasion and mechanism design problems. We study both computation of optimal policies and learning in our setting. While the general problems are computationally intractable, we study algorithmic solutions under a conditional independence assumption on the underlying state-observation distributions. We present a polynomial-time algorithm to compute the principal's optimal policy up to an additive approximation. Additionally, we show an efficient learning algorithm in the case where the transition probabilities are not known beforehand. The algorithm guarantees sublinear regret for both players.
翻译:我们研究了一个主从双方信息不完全的序贯决策问题。在该模型中,主从双方在随机环境中互动,各自持有对方无法观测到的私有状态信息。委托方拥有承诺权,既能从代理方获取信息,也能发出反映自身信息的信号。双方相互传递信号,并根据这些通信独立选择行动。每位参与者根据当前状态及双方联合行动获得收益,随后环境转移至新状态。这一互动在有限时域内持续进行,双方均致力于优化自身在整个时域内的总收益。我们的模型作为特例涵盖了不完全信息随机博弈、部分可观测马尔可夫决策过程(POMDP),以及序贯贝叶斯说服和机制设计问题。我们同时研究了该场景下的最优策略计算与学习问题。尽管一般性问题在计算上具有难解性,我们针对潜在状态-观测分布的条件独立性假设提出了算法解决方案。我们给出了一个多项式时间算法,可计算委托方的最优策略直至加法近似。此外,在转移概率未知的情况下,我们提出了一种高效的学习算法,该算法能保证双方获得次线性遗憾值。