RL-MPCA: A Reinforcement Learning Based Multi-Phase Computation Allocation Approach for Recommender Systems

Recommender systems aim to recommend the most suitable items to users from a large number of candidates. Their computation cost grows as the number of user requests and the complexity of services (or models) increases. Under the limitation of computation resources (CRs), how to make a trade-off between computation cost and business revenue becomes an essential question. The existing studies focus on dynamically allocating CRs in queue truncation scenarios (i.e., allocating the size of candidates), and formulate the CR allocation problem as an optimization problem with constraints. Some of them focus on single-phase CR allocation, and others focus on multi-phase CR allocation but introduce some assumptions about queue truncation scenarios. However, these assumptions do not hold in other scenarios, such as retrieval channel selection and prediction model selection. Moreover, existing studies ignore the state transition process of requests between different phases, limiting the effectiveness of their approaches. This paper proposes a Reinforcement Learning (RL) based Multi-Phase Computation Allocation approach (RL-MPCA), which aims to maximize the total business revenue under the limitation of CRs. RL-MPCA formulates the CR allocation problem as a Weakly Coupled MDP problem and solves it with an RL-based approach. Specifically, RL-MPCA designs a novel deep Q-network to adapt to various CR allocation scenarios, and calibrates the Q-value by introducing multiple adaptive Lagrange multipliers (adaptive-$\lambda$) to avoid violating the global CR constraints. Finally, experiments on the offline simulation environment and online real-world recommender system validate the effectiveness of our approach.

翻译：推荐系统旨在从海量候选项中向用户推荐最合适的物品。随着用户请求数量和服务（或模型）复杂度的增长，其计算成本也随之增加。在计算资源有限的约束下，如何在计算成本与商业收益之间取得平衡成为关键问题。现有研究主要关注队列截断场景（即分配候选项规模）中的计算资源动态分配，并将计算资源分配问题建模为含约束的优化问题。部分研究聚焦于单阶段计算资源分配，另一些则关注多阶段计算资源分配但对队列截断场景引入了假设。然而，这些假设在检索通道选择、预测模型选择等其他场景中并不成立。此外，现有研究忽略了请求在不同阶段间的状态转移过程，限制了方法的有效性。本文提出了一种基于强化学习的多阶段计算分配方法RL-MPCA，旨在计算资源约束下最大化总商业收益。RL-MPCA将计算资源分配问题建模为弱耦合马尔可夫决策过程，并通过基于强化学习的方法求解。具体而言，RL-MPCA设计了新型深度Q网络以适应多种计算资源分配场景，并通过引入多个自适应拉格朗日乘子（adaptive-λ）校准Q值以避免违反全局计算资源约束。最后，离线仿真环境与在线真实推荐系统的实验验证了本方法的有效性。