RLTP: Reinforcement Learning to Pace for Delayed Impression Modeling in Preloaded Ads

To increase brand awareness, many advertisers conclude contracts with advertising platforms to purchase traffic and then deliver advertisements to target audiences. In a whole delivery period, advertisers usually desire a certain impression count for the ads, and they also expect that the delivery performance is as good as possible (e.g., obtaining high click-through rate). Advertising platforms employ pacing algorithms to satisfy the demands via adjusting the selection probabilities to traffic requests in real-time. However, the delivery procedure is also affected by the strategies from publishers, which cannot be controlled by advertising platforms. Preloading is a widely used strategy for many types of ads (e.g., video ads) to make sure that the response time for displaying after a traffic request is legitimate, which results in delayed impression phenomenon. Traditional pacing algorithms cannot handle the preloading nature well because they rely on immediate feedback signals, and may fail to guarantee the demands from advertisers. In this paper, we focus on a new research problem of impression pacing for preloaded ads, and propose a Reinforcement Learning To Pace framework RLTP. It learns a pacing agent that sequentially produces selection probabilities in the whole delivery period. To jointly optimize the two objectives of impression count and delivery performance, RLTP employs tailored reward estimator to satisfy the guaranteed impression count, penalize the over-delivery and maximize the traffic value. Experiments on large-scale industrial datasets verify that RLTP outperforms baseline pacing algorithms by a large margin. We have deployed the RLTP framework online to our advertising platform, and results show that it achieves significant uplift to core metrics including delivery completion rate and click-through rate.

翻译：为提升品牌知名度，众多广告主与广告平台签订流量采购合同，并向目标受众投放广告。在整个投放周期内，广告主通常期望获得特定曝光量，同时追求最优投放效果（如实现高点击率）。广告平台通过控速算法实时调整流量请求的选择概率以满足需求。然而，投放过程还受媒体方策略影响，而此类策略无法由广告平台直接控制。预加载是视频广告等常见广告类型的广泛使用策略，旨在确保流量请求后展示响应时间的合理性，但这会导致延迟曝光现象。传统控速算法因依赖即时反馈信号，难以有效处理预加载特性，可能无法保障广告主需求。本文聚焦预加载广告曝光控速这一新研究问题，提出强化学习控速框架RLTP。该框架训练控速智能体，在完整投放周期内顺序生成选择概率。为联合优化曝光量与投放效果双重目标，RLTP采用定制化奖励估计器，在满足保障曝光量的同时惩罚过度投放并最大化流量价值。基于大规模工业数据集的实验表明，RLTP大幅领先基准控速算法。我们已在自有广告平台部署RLTP线上框架，结果显示其为核心指标（投放完成率与点击率）带来显著提升。