In Bayesian persuasion, an informed sender strategically discloses information to a receiver so as to persuade them to undertake desirable actions. Recently, a growing attention has been devoted to settings in which sender and receivers interact sequentially. Recently, Markov persuasion processes (MPPs) have been introduced to capture sequential scenarios where a sender faces a stream of myopic receivers in a Markovian environment. The MPPs studied so far in the literature suffer from issues that prevent them from being fully operational in practice, e.g., they assume that the sender knows receivers' rewards. We fix such issues by addressing MPPs where the sender has no knowledge about the environment. We design a learning algorithm for the sender, working with partial feedback. We prove that its regret with respect to an optimal information-disclosure policy grows sublinearly in the number of episodes, as it is the case for the loss in persuasiveness cumulated while learning. Moreover, we provide a lower bound for our setting matching the guarantees of our algorithm.
翻译:在贝叶斯说服中,掌握信息的发送者策略性地向接收者披露信息,以说服其采取期望的行动。近年来,发送者与接收者顺序交互的场景日益受到关注。马尔可夫说服过程(MPPs)被引入以捕捉发送者在马尔可夫环境中面对一系列短视接收者的序贯场景。现有文献研究的MPPs存在阻碍其在实际中完全可操作的问题,例如假设发送者已知接收者的收益。我们通过解决发送者对环境一无所知的MPPs来修正这些问题。我们为发送者设计了一种基于部分反馈的学习算法。我们证明,相对于最优信息披露策略的遗憾值随回合数呈次线性增长,正如学习过程中累积的说服力损失一样。此外,我们为我们的问题场景提供了匹配算法保证的下界。