Offline reinforcement learning (RL), a technology that offline learns a policy from logged data without the need to interact with online environments, has become a favorable choice in decision-making processes like interactive recommendation. Offline RL faces the value overestimation problem. To address it, existing methods employ conservatism, e.g., by constraining the learned policy to be close to behavior policies or punishing the rarely visited state-action pairs. However, when applying such offline RL to recommendation, it will cause a severe Matthew effect, i.e., the rich get richer and the poor get poorer, by promoting popular items or categories while suppressing the less popular ones. It is a notorious issue that needs to be addressed in practical recommender systems. In this paper, we aim to alleviate the Matthew effect in offline RL-based recommendation. Through theoretical analyses, we find that the conservatism of existing methods fails in pursuing users' long-term satisfaction. It inspires us to add a penalty term to relax the pessimism on states with high entropy of the logging policy and indirectly penalizes actions leading to less diverse states. This leads to the main technical contribution of the work: Debiased model-based Offline RL (DORL) method. Experiments show that DORL not only captures user interests well but also alleviates the Matthew effect. The implementation is available via https://github.com/chongminggao/DORL-codes.
翻译:离线强化学习(Offline RL)是一种从记录数据中离线学习策略、无需与在线环境交互的技术,已成为交互推荐等决策过程中的优选方案。离线强化学习面临值过高估计问题。为解决该问题,现有方法采用保守策略,例如约束学习策略接近行为策略,或惩罚罕见访问的状态-动作对。然而,当将此类离线强化学习应用于推荐时,会引发严重的马太效应,即富者愈富、贫者愈贫——通过推广热门物品或类别同时抑制冷门物品。这是实际推荐系统中亟需解决的棘手问题。本文旨在缓解基于离线强化学习的推荐中的马太效应。通过理论分析发现,现有方法的保守性无法有效追求用户长期满意度。这启发我们添加惩罚项,以放松对记录策略高熵状态的悲观估计,并间接惩罚导向低多样性状态的动作。由此形成本文的主要技术贡献:去偏的基于模型的离线强化学习方法(DORL)。实验表明,DORL不仅能良好捕捉用户兴趣,还能缓解马太效应。实现代码见 https://github.com/chongminggao/DORL-codes。