North star metrics and online experimentation play a central role in how technology companies improve their products. In many practical settings, however, evaluating experiments based on the north star metric directly can be difficult. The two most significant issues are 1) low sensitivity of the north star metric and 2) differences between the short-term and long-term impact on the north star metric. A common solution is to rely on proxy metrics rather than the north star in experiment evaluation and launch decisions. Existing literature on proxy metrics concentrates mainly on the estimation of the long-term impact from short-term experimental data. In this paper, instead, we focus on the trade-off between the estimation of the long-term impact and the sensitivity in the short term. In particular, we propose the Pareto optimal proxy metrics method, which simultaneously optimizes prediction accuracy and sensitivity. In addition, we give an efficient multi-objective optimization algorithm that outperforms standard methods. We applied our methodology to experiments from a large industrial recommendation system, and found proxy metrics that are eight times more sensitive than the north star and consistently moved in the same direction, increasing the velocity and the quality of the decisions to launch new features.
翻译:北极星指标与在线实验在科技公司改进产品过程中发挥着核心作用。然而,在许多实际场景中,直接基于北极星指标评估实验存在困难。两个最显著的问题在于:1)北极星指标灵敏度较低,2)其对北极星指标的短期与长期影响存在差异。一种常见解决方案是依赖代理指标而非北极星指标进行实验评估和发布决策。现有代理指标文献主要集中于利用短期实验数据估算长期影响。本文则聚焦于长期影响估算与短期灵敏度之间的权衡关系。具体而言,我们提出了帕累托最优代理指标方法,该方法能同时优化预测准确性与灵敏度。此外,我们开发了一种高效的多目标优化算法,其性能优于标准方法。我们将该方法论应用于大型工业推荐系统的实验中,发现代理指标的灵敏度比北极星指标高出八倍,且始终保持一致的变化方向,从而提升了新功能发布决策的速度与质量。