In this paper, we expand the Bayesian persuasion framework to account for unobserved confounding variables in sender-receiver interactions. While traditional models assume that belief updates follow Bayesian principles, real-world scenarios often involve hidden variables that impact the receiver's belief formation and decision-making. We conceptualize this as a sequential decision-making problem, where the sender and receiver interact over multiple rounds. In each round, the sender communicates with the receiver, who also interacts with the environment. Crucially, the receiver's belief update is affected by an unobserved confounding variable. By reformulating this scenario as a Partially Observable Markov Decision Process (POMDP), we capture the sender's incomplete information regarding both the dynamics of the receiver's beliefs and the unobserved confounder. We prove that finding an optimal observation-based policy in this POMDP is equivalent to solving for an optimal signaling strategy in the original persuasion framework. Furthermore, we demonstrate how this reformulation facilitates the application of proximal learning for off-policy evaluation in the persuasion process. This advancement enables the sender to evaluate alternative signaling strategies using only observational data from a behavioral policy, thus eliminating the necessity for costly new experiments.
翻译:本文扩展了贝叶斯劝说框架,以处理发送者-接收者交互中存在的未观测混杂变量。传统模型假设信念更新遵循贝叶斯原则,但现实场景常涉及影响接收者信念形成与决策过程的隐藏变量。我们将此概念化为一个序贯决策问题,其中发送者与接收者进行多轮交互。在每一轮中,发送者与接收者进行沟通,而接收者同时与环境产生交互。关键之处在于,接收者的信念更新受到一个未观测混杂变量的影响。通过将该场景重新表述为部分可观测马尔可夫决策过程(POMDP),我们刻画了发送者对接收者信念动态及未观测混杂因素均存在信息不完全的特性。我们证明,在此POMDP中寻找基于观测的最优策略等价于在原劝说框架中求解最优信号发送策略。进一步,我们论证了该重构如何促进近端学习方法在劝说过程离策略评估中的应用。这一进展使得发送者能够仅通过行为策略的观测数据来评估替代性信号发送策略,从而免除了进行昂贵新实验的必要性。