We study a sequential decision making problem between a principal and an agent with incomplete information on both sides. In this model, the principal and the agent interact in a stochastic environment, and each is privy to observations about the state not available to the other. The principal has the power of commitment, both to elicit information from the agent and to provide signals about her own information. The principal and the agent communicate their signals to each other, and select their actions independently based on this communication. Each player receives a payoff based on the state and their joint actions, and the environment moves to a new state. The interaction continues over a finite time horizon, and both players act to optimize their own total payoffs over the horizon. Our model encompasses as special cases stochastic games of incomplete information and POMDPs, as well as sequential Bayesian persuasion and mechanism design problems. We study both computation of optimal policies and learning in our setting. While the general problems are computationally intractable, we study algorithmic solutions under a conditional independence assumption on the underlying state-observation distributions. We present an polynomial-time algorithm to compute the principal's optimal policy up to an additive approximation. Additionally, we show an efficient learning algorithm in the case where the transition probabilities are not known beforehand. The algorithm guarantees sublinear regret for both players.
翻译:我们研究在信息不对称条件下委托人与代理人之间的序贯决策问题。在该模型中,委托人与代理人在随机环境中进行交互,双方各自持有对方未知的状态观测信息。委托人拥有承诺权力,既能从代理人处获取信息,也能提供自身信息的信号。委托人与代理人相互传递信号,并基于这些沟通独立选择各自行动。每位参与者根据当前状态及双方联合行动获得收益,随后环境状态发生转移。该交互过程在有限时间范围内持续进行,双方均以优化各自总收益为目标。我们的模型将不完全信息随机博弈、部分可观测马尔可夫决策过程(POMDP),以及序贯贝叶斯说服和机制设计问题作为特例涵盖其中。我们研究了该场景下最优策略的计算问题与学习问题。尽管一般情况下的问题在计算上不可解,我们针对底层状态-观测分布满足条件独立性假设的情形提出了算法解决方案。我们给出一个多项式时间算法,可逼近计算委托人的最优策略(加性误差)。此外,在转移概率未知的情况下,我们给出一个高效学习算法,该算法保证双方获得次线性遗憾。