In Bayesian persuasion, an informed sender strategically discloses information to a receiver so as to persuade them to undertake desirable actions. Recently, a growing attention has been devoted to settings in which sender and receivers interact sequentially. Recently, Markov persuasion processes (MPPs) have been introduced to capture sequential scenarios where a sender faces a stream of myopic receivers in a Markovian environment. The MPPs studied so far in the literature suffer from issues that prevent them from being fully operational in practice, e.g., they assume that the sender knows receivers' rewards. We fix such issues by addressing MPPs where the sender has no knowledge about the environment. We design a learning algorithm for the sender, working with partial feedback. We prove that its regret with respect to an optimal information-disclosure policy grows sublinearly in the number of episodes, as it is the case for the loss in persuasiveness cumulated while learning. Moreover, we provide a lower bound for our setting matching the guarantees of our algorithm.
翻译:在贝叶斯说服中,具有信息优势的发送者策略性地向接收者披露信息,以说服其采取期望的行动。近年来,发送者与接收者进行顺序交互的场景日益受到关注。马尔可夫说服过程(MPPs)被引入以捕捉这样的顺序场景:发送者在马尔可夫环境中面对一群短视的接收者。现有文献中研究的MPPs存在阻碍其在实际中完全可操作的问题,例如假设发送者知晓接收者的收益。我们通过研究发送者对环境一无所知的MPPs来解决这些问题。我们为发送者设计了一种基于部分反馈的学习算法。我们证明,该算法相对于最优信息披露策略的遗憾值随回合数呈次线性增长,同时学习过程中累积的说服力损失也具有相同特性。此外,我们为这一设置提供了与算法性能保障相匹配的下界。