We introduce a new sequential transformer reinforcement learning architecture RLT4Rec and demonstrate that it achieves excellent performance in a range of item recommendation tasks. RLT4Rec uses a relatively simple transformer architecture that takes as input the user's (item,rating) history and outputs the next item to present to the user. Unlike existing RL approaches, there is no need to input a state observation or estimate. RLT4Rec handles new users and established users within the same consistent framework and automatically balances the "exploration" needed to discover the preferences of a new user with the "exploitation" that is more appropriate for established users. Training of RLT4Rec is robust and fast and is insensitive to the choice of training data, learning to generate "good" personalised sequences that the user tends to rate highly even when trained on "bad" data.
翻译:本文提出一种新颖的顺序Transformer强化学习架构RLT4Rec,并证明其在一系列物品推荐任务中取得了卓越性能。RLT4Rec采用相对简洁的Transformer架构,以用户的(物品,评分)历史记录作为输入,输出下一个推荐给用户的物品。与现有强化学习方法不同,该架构无需输入状态观测或估计。RLT4Rec在同一统一框架内处理新用户与既有用户,并自动平衡发现新用户偏好所需的“探索”与更适合既有用户的“利用”。RLT4Rec的训练过程稳健且快速,对训练数据的选择不敏感,即使在“不良”数据上训练,也能学会生成用户倾向于给予高评分的“优质”个性化序列。