Two-Stage Constrained Actor-Critic for Short Video Recommendation

The wide popularity of short videos on social media poses new opportunities and challenges to optimize recommender systems on the video-sharing platforms. Users sequentially interact with the system and provide complex and multi-faceted responses, including watch time and various types of interactions with multiple videos. One the one hand, the platforms aims at optimizing the users' cumulative watch time (main goal) in long term, which can be effectively optimized by Reinforcement Learning. On the other hand, the platforms also needs to satisfy the constraint of accommodating the responses of multiple user interactions (auxiliary goals) such like, follow, share etc. In this paper, we formulate the problem of short video recommendation as a Constrained Markov Decision Process (CMDP). We find that traditional constrained reinforcement learning algorithms can not work well in this setting. We propose a novel two-stage constrained actor-critic method: At stage one, we learn individual policies to optimize each auxiliary signal. At stage two, we learn a policy to (i) optimize the main signal and (ii) stay close to policies learned at the first stage, which effectively guarantees the performance of this main policy on the auxiliaries. Through extensive offline evaluations, we demonstrate effectiveness of our method over alternatives in both optimizing the main goal as well as balancing the others. We further show the advantage of our method in live experiments of short video recommendations, where it significantly outperforms other baselines in terms of both watch time and interactions. Our approach has been fully launched in the production system to optimize user experiences on the platform.

翻译：短视频在社交媒体上的广泛普及为优化视频分享平台上的推荐系统带来了新的机遇与挑战。用户与系统进行顺序交互，并产生复杂且多维的反馈，包括观看时长以及与多个视频进行的各类交互（如点赞、关注、分享等）。一方面，平台致力于长期优化用户的累计观看时长（主要目标），这一目标可通过强化学习有效实现；另一方面，平台还需满足约束条件，以兼容多种用户交互行为（辅助目标）。本文将短视频推荐问题建模为约束马尔可夫决策过程（CMDP）。我们发现传统约束强化学习算法在此场景下难以奏效。为此，我们提出一种新颖的两阶段约束演员-评论家方法：在第一阶段，学习针对每个辅助信号的独立策略；在第二阶段，学习一个既能优化主要信号，又能与第一阶段所学策略保持接近的策略，从而有效保证主策略在辅助目标上的性能。通过大量离线评估，我们证明了该方法在优化主要目标及平衡其他目标方面均优于现有方案。此外，在短视频推荐在线实验中，该方法在观看时长和交互量上均显著超越其他基线。目前，本方法已全面部署于生产系统，用于优化平台用户体验。