Optimizing multiple objectives simultaneously is an important task in recommendation platforms to improve their performance on different fronts. However, this task is particularly challenging since the relationships between different objectives are heterogeneous across different consumers and dynamically fluctuating according to different contexts. Especially in those cases when objectives become conflicting with each other, the result of recommendations will form a pareto-frontier, where the improvements on any objective comes at the cost of a performance decrease in another objective. Unfortunately, existing multi-objective recommender systems do not systematically consider such relationships; instead, they balance between these objectives in a static and uniform manner, resulting in performance that is significantly worse than the pareto-optimality. In this paper, we propose a Deep Pareto Reinforcement Learning (DeepPRL) approach, where we (1) comprehensively model the complex relationships between multiple objectives in recommendations; (2) effectively capture the personalized and contextual consumer preference towards each objective and update the recommendations correspondingly; (3) optimize both the short-term and the long-term performance of multi-objective recommendations. As a result, our method achieves significant pareto-dominance over state-of-the-art baselines in extensive offline experiments conducted on three real-world datasets. Furthermore, we conduct a large-scale online controlled experiment at the video streaming platform of Alibaba, where our method simultaneously improves the three conflicting objectives of Click-Through Rate, Video View, and Dwell Time by 2%, 5%, and 7% respectively over the latest production system, demonstrating its tangible economic impact in industrial applications.
翻译:在推荐平台中同时优化多个目标对于提升其在不同方面的性能至关重要。然而,由于不同目标之间的关系因消费者而异,并随不同情境动态波动,该任务尤为复杂。特别是在目标之间相互冲突的情况下,推荐结果将形成帕累托前沿,其中任一目标的改进均以另一目标性能下降为代价。遗憾的是,现有多目标推荐系统未能系统性地考虑此类关系,而是以静态统一的方式平衡各目标,导致其性能显著低于帕累托最优状态。本文提出一种深度帕累托强化学习方法,其具备以下特点:(1) 全面建模推荐中多目标间的复杂关系;(2) 有效捕捉消费者对每个目标的个性化情境偏好并相应更新推荐;(3) 同步优化多目标推荐的短期与长期性能。通过在三个真实数据集上的大量离线实验,本方法相较于前沿基线模型实现了显著的帕累托支配优势。此外,我们在阿里巴巴视频流平台开展大规模在线对照实验,本方法在点击率、视频播放量与停留时长这三个相互冲突的目标上,较最新生产系统分别提升2%、5%与7%,证明了其在工业应用中的实际经济效益。