Cascading Reinforcement Learning

Cascading bandits have gained popularity in recent years due to their applicability to recommendation systems and online advertising. In the cascading bandit model, at each timestep, an agent recommends an ordered subset of items (called an item list) from a pool of items, each associated with an unknown attraction probability. Then, the user examines the list, and clicks the first attractive item (if any), and after that, the agent receives a reward. The goal of the agent is to maximize the expected cumulative reward. However, the prior literature on cascading bandits ignores the influences of user states (e.g., historical behaviors) on recommendations and the change of states as the session proceeds. Motivated by this fact, we propose a generalized cascading RL framework, which considers the impact of user states and state transition into decisions. In cascading RL, we need to select items not only with large attraction probabilities but also leading to good successor states. This imposes a huge computational challenge due to the combinatorial action space. To tackle this challenge, we delve into the properties of value functions, and design an oracle BestPerm to efficiently find the optimal item list. Equipped with BestPerm, we develop two algorithms CascadingVI and CascadingBPI, which are both computationally-efficient and sample-efficient, and provide near-optimal regret and sample complexity guarantees. Furthermore, we present experiments to show the improved computational and sample efficiencies of our algorithms compared to straightforward adaptations of existing RL algorithms in practice.

翻译：级联赌博机（cascading bandits）因在推荐系统和在线广告中的适用性，近年来广受关注。在级联赌博机模型中，每个时间步，智能体从项目池中推荐一个有序子集（称为项目列表），每个项目关联一个未知的吸引概率。随后用户浏览列表，点击第一个有吸引力的项目（如有），智能体随后获得奖励。智能体的目标是最大化期望累积奖励。然而，现有级联赌博机文献忽视了用户状态（如历史行为）对推荐的影响，以及随会话进程状态的变化。基于此，我们提出一个广义级联强化学习框架，将用户状态及状态转移对决策的影响纳入考量。在级联强化学习中，我们不仅要选择具有高吸引概率的项目，还需选择能导向良好后续状态的项目。由于组合动作空间的存在，这带来了巨大的计算挑战。为解决该挑战，我们深入探究值函数的性质，并设计了一个名为BestPerm的优化器，以高效找到最优项目列表。借助BestPerm，我们开发了CascadingVI和CascadingBPI两种算法，两者均具备计算高效性和样本高效性，并提供接近最优的遗憾界和样本复杂度保证。此外，我们通过实验表明，与现有强化学习算法的直接适配版本相比，我们所提算法在实际中具有更优的计算和样本效率。