Reinforcement Learning-based Recommender Systems (RLRS) have shown promise across a spectrum of applications, from e-commerce platforms to streaming services. Yet, they grapple with challenges, notably in crafting reward functions and harnessing large pre-existing datasets within the RL framework. Recent advancements in offline RLRS provide a solution for how to address these two challenges. However, existing methods mainly rely on the transformer architecture, which, as sequence lengths increase, can introduce challenges associated with computational resources and training costs. Additionally, the prevalent methods employ fixed-length input trajectories, restricting their capacity to capture evolving user preferences. In this study, we introduce a new offline RLRS method to deal with the above problems. We reinterpret the RLRS challenge by modeling sequential decision-making as an inference task, leveraging adaptive masking configurations. This adaptive approach selectively masks input tokens, transforming the recommendation task into an inference challenge based on varying token subsets, thereby enhancing the agent's ability to infer across diverse trajectory lengths. Furthermore, we incorporate a multi-scale segmented retention mechanism that facilitates efficient modeling of long sequences, significantly enhancing computational efficiency. Our experimental analysis, conducted on both online simulator and offline datasets, clearly demonstrates the advantages of our proposed method.
翻译:基于强化学习的推荐系统(RLRS)在从电子商务平台到流媒体服务等众多应用中展现出巨大潜力。然而,这类系统仍面临诸多挑战,特别是在奖励函数设计以及如何在强化学习框架中充分利用大规模预训练数据集方面。近年来离线RLRS的进展为解决这两个问题提供了方案。但现有方法主要依赖Transformer架构,随着序列长度的增加,该架构会引发计算资源与训练成本方面的挑战。此外,当前主流方法采用固定长度输入轨迹,限制了其捕捉用户动态偏好演变的能力。本研究提出一种新的离线RLRS方法以解决上述问题。我们通过将序列决策建模为推理任务,并利用自适应掩码配置重新诠释RLRS挑战。该自适应方法通过选择性掩码输入标记,将推荐任务转化为基于可变标记子集的推理挑战,从而增强智能体在不同轨迹长度下的推断能力。此外,我们引入多尺度分段保留机制,既实现了对长序列的高效建模,又显著提升了计算效率。在在线模拟器与离线数据集上的实验分析充分证明了我们提出方法的优越性。