Online evaluation of ranking and retrieval systems often relies on downstream monetization metrics such as app revenue or creator earnings. These metrics are typically heavy-tailed, with a small fraction of users dominating both mean and variance, leading to low statistical power and unreliable conclusions in A/B experiments -- especially under limited traffic. We present a practical framework for variance reduction in online experiments by combining post-stratification with CUPED. Our approach leverages pre-experiment covariates to improve the sensitivity of monetization experiments without requiring additional traffic. Deployed at ShareChat across ranking-driven monetization experiments, the method substantially reduces variance and improves decision stability, achieving equivalent statistical confidence with ~45\% less traffic than standard metrics. We further discuss practical design choices, guardrails, and limitations, providing guidance on when post-stratification is appropriate for real-world information retrieval and Recommendation systems.
翻译:在线评估排序与检索系统时,常依赖下游货币化指标(如应用收入或创作者收益)。此类指标通常具有重尾分布特征,少数用户主导了均值和方差,导致A/B实验的统计功效较低、结论不可靠——尤其在流量受限场景下。本文提出一种结合事后分层与CUPED的实用方差缩减框架,利用实验前协变量提升货币化实验的敏感性,无需额外流量。该框架在ShareChat平台的排序驱动货币化实验中部署后,显著降低了方差并提升了决策稳定性,在约减少45%流量的条件下达到等效统计置信度。我们进一步讨论了实践设计选择、防护机制与局限性,为真实世界信息检索与推荐系统场景中事后分层的适用性提供指导。