Interactive Recommender Systems (IRSs) have attracted a lot of attention, due to their ability to model interactive processes between users and recommender systems. Numerous approaches have adopted Reinforcement Learning (RL) algorithms, as these can directly maximize users' cumulative rewards. In IRS, researchers commonly utilize publicly available review datasets to compare and evaluate algorithms. However, user feedback provided in public datasets merely includes instant responses (e.g., a rating), with no inclusion of delayed responses (e.g., the dwell time and the lifetime value). Thus, the question remains whether these review datasets are an appropriate choice to evaluate the long-term effects of the IRS. In this work, we revisited experiments on IRS with review datasets and compared RL-based models with a simple reward model that greedily recommends the item with the highest one-step reward. Following extensive analysis, we can reveal three main findings: First, a simple greedy reward model consistently outperforms RL-based models in maximizing cumulative rewards. Second, applying higher weighting to long-term rewards leads to a degradation of recommendation performance. Third, user feedbacks have mere long-term effects on the benchmark datasets. Based on our findings, we conclude that a dataset has to be carefully verified and that a simple greedy baseline should be included for a proper evaluation of RL-based IRS approaches.
翻译:交互式推荐系统因其能够建模用户与推荐系统之间的交互过程而受到广泛关注。诸多方法采用强化学习算法,这些算法可直接最大化用户的累积奖励。在交互式推荐系统中,研究者通常利用公开的评论数据集来比较和评估算法。然而,公开数据集中提供的用户反馈仅包含即时响应(如评分),而未包含延迟响应(如停留时间和生命周期价值)。因此,这些评论数据集是否适合评估交互式推荐系统的长期效果仍存疑问。本文重新审视了基于评论数据集的交互式推荐系统实验,并将基于强化学习的模型与一个简单奖励模型(该模型贪婪地推荐单步奖励最高的物品)进行了对比。通过深入分析,我们揭示了三个主要发现:第一,简单贪婪奖励模型在最大化累积奖励方面始终优于基于强化学习的模型;第二,对长期奖励赋予更高权重会导致推荐性能下降;第三,基准数据集中用户反馈的长期效应微乎其微。基于这些发现,我们得出结论:数据集需经过仔细验证,并且应在基于强化学习的交互式推荐系统方法评估中纳入简单贪婪基线模型。