This paper studies the problem of learning interactive recommender systems from logged feedbacks without any exploration in online environments. We address the problem by proposing a general offline reinforcement learning framework for recommendation, which enables maximizing cumulative user rewards without online exploration. Specifically, we first introduce a probabilistic generative model for interactive recommendation, and then propose an effective inference algorithm for discrete and stochastic policy learning based on logged feedbacks. In order to perform offline learning more effectively, we propose five approaches to minimize the distribution mismatch between the logging policy and recommendation policy: support constraints, supervised regularization, policy constraints, dual constraints and reward extrapolation. We conduct extensive experiments on two public real-world datasets, demonstrating that the proposed methods can achieve superior performance over existing supervised learning and reinforcement learning methods for recommendation.
翻译:本文研究了如何在无在线环境探索的情况下,从记录反馈中学习交互式推荐系统的问题。我们通过提出一种面向推荐的通用离线强化学习框架来解决该问题,该框架能够在无需在线探索的前提下最大化累积用户奖励。具体而言,我们首先为交互式推荐引入了一个概率生成模型,随后提出了一种基于记录反馈的离散随机策略学习的有效推理算法。为更有效地实现离线学习,我们提出了五种方法来最小化记录策略与推荐策略之间的分布不匹配:支持约束、监督正则化、策略约束、对偶约束以及奖励外推。我们在两个公开的真实数据集上进行了大量实验,结果表明所提方法在推荐性能上优于现有的监督学习与强化学习方法。