We consider the problem of sequential recommendation, where the current recommendation is made based on past interactions. This recommendation task requires efficient processing of the sequential data and aims to provide recommendations that maximize the long-term reward. To this end, we train a farsighted recommender by using an offline RL algorithm with the policy network in our model architecture that has been initialized from a pre-trained transformer model. The pre-trained model leverages the superb ability of the transformer to process sequential information. Compared to prior works that rely on online interaction via simulation, we focus on implementing a fully offline RL framework that is able to converge in a fast and stable way. Through extensive experiments on public datasets, we show that our method is robust across various recommendation regimes, including e-commerce and movie suggestions. Compared to state-of-the-art supervised learning algorithms, our algorithm yields recommendations of higher quality, demonstrating the clear advantage of combining RL and transformers.
翻译:我们研究序列推荐问题,其中当前推荐基于历史交互生成。该推荐任务需要对序列数据高效处理,旨在提供最大化长期收益的推荐方案。为此,我们采用离线强化学习算法训练具有远见的推荐器,其模型架构中的策略网络初始化为预训练Transformer模型。该预训练模型充分利用Transformer在序列信息处理方面的卓越能力。区别于依赖仿真在线交互的现有研究,我们聚焦于实现完全离线强化学习框架,该框架能够快速稳定地收敛。通过在公开数据集上的大量实验表明,本方法在包含电子商务和电影推荐在内的多种推荐场景中均具有鲁棒性。与当前最先进的监督学习算法相比,我们的算法能生成更高质量的推荐结果,充分证明了强化学习与Transformer结合带来的显著优势。