Recommender systems predict what items a user will interact with next, based on their past interactions. The problem is often approached through supervised learning, but recent advancements have shifted towards policy optimization of rewards (e.g., user engagement). One challenge with the latter is policy mismatch: we are only able to train a new policy given data collected from a previously-deployed policy. The conventional way to address this problem is through importance sampling correction, but this comes with practical limitations. We suggest an alternative approach of local policy improvement without off-policy correction. Our method computes and optimizes a lower bound of expected reward of the target policy, which is easy to estimate from data and does not involve density ratios (such as those appearing in importance sampling correction). This local policy improvement paradigm is ideal for recommender systems, as previous policies are typically of decent quality and policies are updated frequently. We provide empirical evidence and practical recipes for applying our technique in a sequential recommendation setting.
翻译:推荐系统根据用户过去的行为预测其下一步将交互的物品。该问题通常通过监督学习解决,但近年来的进展已转向对奖励(如用户参与度)的策略优化。后者面临的一个挑战是策略失配:我们只能基于先前部署策略收集的数据来训练新策略。解决该问题的传统方法依赖重要性采样修正,但这存在实际局限性。我们提出了一种无需离线策略修正的局部策略改进替代方法。该方法计算并优化目标策略期望奖励的下界,该下界易于从数据中估计且无需密度比(例如重要性采样修正中出现的比率)。这种局部策略改进范式非常适合推荐系统,因为先前策略通常质量尚可且策略更新频繁。我们提供了在序列推荐场景中应用该技术的实证证据与实践指南。