Real-world recommender systems often need to balance multiple objectives when deciding which recommendations to present to users. These include behavioural signals (e.g. clicks, shares, dwell time), as well as broader objectives (e.g. diversity, fairness). Scalarisation methods are commonly used to handle this balancing task, where a weighted average of per-objective reward signals determines the final score used for ranking. Naturally, how these weights are computed exactly, is key to success for any online platform. We frame this as a decision-making task, where the scalarisation weights are actions taken to maximise an overall North Star reward (e.g. long-term user retention or growth). We extend existing policy learning methods to the continuous multivariate action domain, proposing to maximise a pessimistic lower bound on the North Star reward that the learnt policy will yield. Typical lower bounds based on normal approximations suffer from insufficient coverage, and we propose an efficient and effective policy-dependent correction for this. We provide guidance to design stochastic data collection policies, as well as highly sensitive reward signals. Empirical observations from simulations, offline and online experiments highlight the efficacy of our deployed approach.
翻译:真实世界的推荐系统在决定向用户展示哪些推荐内容时,往往需要平衡多个目标。这些目标包括行为信号(如点击、分享、停留时间)以及更广泛的目标(如多样性、公平性)。标量化方法常用于处理这种平衡任务,即通过每个目标奖励信号的加权平均来确定用于排序的最终分数。自然,这些权重的精确计算方式对于任何在线平台的成功都至关重要。我们将此问题建模为一项决策任务,其中标量化权重作为动作,旨在最大化整体北极星奖励(例如长期用户留存或增长)。我们将现有策略学习方法扩展到连续多变量动作空间,提出最大化所学策略将产生的北极星奖励的悲观下界。基于正态近似的典型下界存在覆盖范围不足的问题,我们为此提出了一种高效且有效的策略依赖修正方法。我们还提供了设计随机数据收集策略及高灵敏度奖励信号的指导原则。来自仿真、离线及在线实验的实证结果凸显了我们所部署方法的有效性。