We introduce and study the online Bayesian recommendation problem for a recommender system platform. The platform has the privilege to privately observe a utility-relevant \emph{state} of a product at each round and uses this information to make online recommendations to a stream of myopic users. This paradigm is common in a wide range of scenarios in the current Internet economy. The platform commits to an online recommendation policy that utilizes her information advantage on the product state to persuade self-interested users to follow the recommendation. Since the platform does not know users' preferences or beliefs in advance, we study the platform's online learning problem of designing an adaptive recommendation policy to persuade users while gradually learning users' preferences and beliefs en route. Specifically, we aim to design online learning policies with no \emph{Stackelberg regret} for the platform, i.e., against the optimal benchmark policy in hindsight under the assumption that users will correspondingly adapt their responses to the benchmark policy. Our first result is an online policy that achieves double logarithmic regret dependence on the number of rounds. We also present an information-theoretic lower bound showing that no adaptive online policy can achieve regret with better dependency on the number of rounds. Finally, by formulating the platform's problem as optimizing a linear program with membership oracle access, we present our second online recommendation policy that achieves regret with polynomial dependence on the number of states but logarithmic dependence on the number of rounds.
翻译:我们提出并研究了推荐系统平台中的在线贝叶斯推荐问题。平台每轮可私密观测到产品关于效用的\textit{状态}信息,并利用该信息向一系列短视用户进行在线推荐。这一范式广泛存在于当前互联网经济的多种场景中。平台承诺采用在线推荐策略,利用其在产品状态上的信息优势,说服自利型用户采纳推荐。由于平台事先不了解用户的偏好或信念,我们研究了平台如何设计自适应性推荐策略,在逐步学习用户偏好与信念的同时说服用户。具体而言,我们旨在为平台设计具有\textit{Stackelberg无遗憾}的在线学习策略,即在反向假设用户会相应调整对基准策略响应的前提下,与事后最优基准策略相比的遗憾。我们的第一项成果是提出一种在线策略,其遗憾值关于轮数呈双对数依赖关系。我们还给出了信息论下界,证明任何自适应在线策略都无法在轮数依赖关系上实现更优的遗憾。最后,通过将平台问题形式化为具有成员查询接口的线性规划优化,我们提出了第二种在线推荐策略,其遗憾值关于状态数呈多项式依赖,但关于轮数呈对数依赖。