Cache-Aware Reinforcement Learning in Large-Scale Recommender Systems

Modern large-scale recommender systems are built upon computation-intensive infrastructure and usually suffer from a huge difference in traffic between peak and off-peak periods. In peak periods, it is challenging to perform real-time computation for each request due to the limited budget of computational resources. The recommendation with a cache is a solution to this problem, where a user-wise result cache is used to provide recommendations when the recommender system cannot afford a real-time computation. However, the cached recommendations are usually suboptimal compared to real-time computation, and it is challenging to determine the items in the cache for each user. In this paper, we provide a cache-aware reinforcement learning (CARL) method to jointly optimize the recommendation by real-time computation and by the cache. We formulate the problem as a Markov decision process with user states and a cache state, where the cache state represents whether the recommender system performs recommendations by real-time computation or by the cache. The computational load of the recommender system determines the cache state. We perform reinforcement learning based on such a model to improve user engagement over multiple requests. Moreover, we show that the cache will introduce a challenge called critic dependency, which deteriorates the performance of reinforcement learning. To tackle this challenge, we propose an eigenfunction learning (EL) method to learn independent critics for CARL. Experiments show that CARL can significantly improve the users' engagement when considering the result cache. CARL has been fully launched in Kwai app, serving over 100 million users.

翻译：现代大规模推荐系统构建于计算密集型基础设施之上，通常面临峰值与非峰值流量之间的巨大差异。在峰值时段，受限于有限的计算资源预算，对每个请求进行实时计算极具挑战性。采用缓存的推荐方案可解决此问题——当推荐系统无法承担实时计算时，利用用户级结果缓存提供推荐。然而，相较于实时计算，缓存推荐通常为次优方案，且为每个用户确定缓存中的物品是一项难题。本文提出一种缓存感知强化学习（CARL）方法，以联合优化实时计算与缓存推荐。我们将该问题建模为包含用户状态与缓存状态的马尔可夫决策过程，其中缓存状态表征推荐系统通过实时计算还是缓存进行推荐。推荐系统的计算负载决定了缓存状态。基于该模型执行强化学习，可在多次请求中提升用户参与度。进一步研究表明，缓存会引发称为"评论家依赖"的挑战，从而削弱强化学习性能。为应对该挑战，我们提出特征函数学习（EL）方法，为CARL学习独立的评论家。实验表明，在考虑结果缓存时，CARL能显著提升用户参与度。目前CARL已在快手应用全面上线，服务超1亿用户。