Persuading Farsighted Receivers in MDPs: the Power of Honesty

Bayesian persuasion studies the problem faced by an informed sender who strategically discloses information to influence the behavior of an uninformed receiver. Recently, a growing attention has been devoted to settings where the sender and the receiver interact sequentially, in which the receiver's decision-making problem is usually modeled as a Markov decision process (MDP). However, previous works focused on computing optimal information-revelation policies (a.k.a. signaling schemes) under the restrictive assumption that the receiver acts myopically, selecting actions to maximize the one-step utility and disregarding future rewards. This is justified by the fact that, when the receiver is farsighted and thus considers future rewards, finding an optimal Markovian signaling scheme is NP-hard. In this paper, we show that Markovian signaling schemes do not constitute the "right" class of policies. Indeed, differently from most of the MDPs settings, we prove that Markovian signaling schemes are not optimal, and general history-dependent signaling schemes should be considered. Moreover, we also show that history-dependent signaling schemes circumvent the negative complexity results affecting Markovian signaling schemes. Formally, we design an algorithm that computes an optimal and {\epsilon}-persuasive history-dependent signaling scheme in time polynomial in 1/{\epsilon} and in the instance size. The crucial challenge is that general history-dependent signaling schemes cannot be represented in polynomial space. Nevertheless, we introduce a convenient subclass of history-dependent signaling schemes, called promise-form, which are as powerful as general history-dependent ones and efficiently representable. Intuitively, promise-form signaling schemes compactly encode histories in the form of honest promises on future receiver's rewards.

翻译：贝叶斯说服研究的是有信息的发送者通过策略性信息披露来影响无信息接收者行为的问题。近年来，研究者越来越关注发送者与接收者顺序交互的场景，其中接收者的决策问题通常被建模为马尔可夫决策过程（MDP）。然而，先前的工作主要关注在接收者短视行为（即仅选择最大化单步效用的行动而忽略未来奖励）这一限制性假设下计算最优信息揭示策略（即信号方案）。这种假设的合理性在于，当接收者具备远见并考虑未来奖励时，寻找最优马尔可夫信号方案是NP困难的。本文证明马尔可夫信号方案并非"正确"的策略类别。具体而言，与大多数MDP环境不同，我们证明马尔可夫信号方案并非最优，而应考虑一般的依赖历史的信号方案。此外，我们还表明依赖历史的信号方案能够规避影响马尔可夫信号方案的负面复杂性结果。形式上，我们设计了一种算法，能够在1/ε和实例规模的多项式时间内计算出最优且ε-说服性的依赖历史的信号方案。关键挑战在于一般的依赖历史的信号方案无法在多项式空间中表示。尽管如此，我们引入了一个便捷的依赖历史信号方案子类——称为承诺形式，其与一般依赖历史方案具有相同效力且可高效表示。直观而言，承诺形式信号方案通过以未来接收者奖励的诚实承诺形式紧凑地编码历史信息。