Rewards serve as a measure of user satisfaction and act as a limiting factor in interactive recommender systems. In this research, we focus on the problem of learning to reward (LTR), which is fundamental to reinforcement learning. Previous approaches either introduce additional procedures for learning to reward, thereby increasing the complexity of optimization, or assume that user-agent interactions provide perfect demonstrations, which is not feasible in practice. Ideally, we aim to employ a unified approach that optimizes both the reward and policy using compositional demonstrations. However, this requirement presents a challenge since rewards inherently quantify user feedback on-policy, while recommender agents approximate off-policy future cumulative valuation. To tackle this challenge, we propose a novel batch inverse reinforcement learning paradigm that achieves the desired properties. Our method utilizes discounted stationary distribution correction to combine LTR and recommender agent evaluation. To fulfill the compositional requirement, we incorporate the concept of pessimism through conservation. Specifically, we modify the vanilla correction using Bellman transformation and enforce KL regularization to constrain consecutive policy updates. We use two real-world datasets which represent two compositional coverage to conduct empirical studies, the results also show that the proposed method relatively improves both effectiveness (2.3\%) and efficiency (11.53\%)
翻译:奖励衡量用户满意度,并作为交互式推荐系统的限制性因素。本研究聚焦于奖励学习这一强化学习的基础问题。先前的方法要么引入额外流程来学习奖励,从而增加优化复杂度;要么假设用户-智能体交互提供完美示范,这在实践中不可行。理想情况下,我们旨在采用统一方法,利用组合式示范同时优化奖励和策略。然而,这一需求带来挑战,因为奖励本质上是策略内用户反馈的量化,而推荐智能体近似于策略外未来累积估值。为应对这一挑战,我们提出了一种新颖的批量逆强化学习范式,实现了所需特性。我们的方法利用折扣平稳分布校正来结合奖励学习与推荐智能体评估。为满足组合式要求,我们通过保守性引入悲观概念。具体而言,我们使用贝尔曼变换对原生校正进行修改,并强制KL正则化以约束连续策略更新。我们采用两个真实世界数据集(分别代表两种组合式覆盖情况)进行实证研究,结果表明所提方法在有效性(提升2.3%)和效率(提升11.53%)上均相对改善。