RLTP: Reinforcement Learning to Pace for Delayed Impression Modeling in Preloaded Ads

To increase brand awareness, many advertisers conclude contracts with advertising platforms to purchase traffic and then deliver advertisements to target audiences. In a whole delivery period, advertisers usually desire a certain impression count for the ads, and they also expect that the delivery performance is as good as possible (e.g., obtaining high click-through rate). Advertising platforms employ pacing algorithms to satisfy the demands via adjusting the selection probabilities to traffic requests in real-time. However, the delivery procedure is also affected by the strategies from publishers, which cannot be controlled by advertising platforms. Preloading is a widely used strategy for many types of ads (e.g., video ads) to make sure that the response time for displaying after a traffic request is legitimate, which results in delayed impression phenomenon. Traditional pacing algorithms cannot handle the preloading nature well because they rely on immediate feedback signals, and may fail to guarantee the demands from advertisers. In this paper, we focus on a new research problem of impression pacing for preloaded ads, and propose a Reinforcement Learning To Pace framework RLTP. It learns a pacing agent that sequentially produces selection probabilities in the whole delivery period. To jointly optimize the two objectives of impression count and delivery performance, RLTP employs tailored reward estimator to satisfy the guaranteed impression count, penalize the over-delivery and maximize the traffic value. Experiments on large-scale industrial datasets verify that RLTP outperforms baseline pacing algorithms by a large margin. We have deployed the RLTP framework online to our advertising platform, and results show that it achieves significant uplift to core metrics including delivery completion rate and click-through rate.

翻译：为提升品牌知名度，众多广告主与广告平台签订合约购买流量，从而向目标受众投放广告。在整个投放周期内，广告主通常期望广告达到特定曝光量，并同时追求最优投放效果（例如获得高点击率）。广告平台通过实时调整流量请求的选中概率，采用调频算法以满足上述需求。然而，投放过程亦受发布方策略的影响，此类策略不受广告平台控制。预加载是多种广告类型（如视频广告）广泛采用的策略，旨在确保流量请求后的展示响应时间符合规范，但会导致曝光延迟现象。传统调频算法依赖即时反馈信号，难以有效处理预加载特性，可能无法满足广告主需求。本文聚焦于预加载广告曝光调频这一新研究问题，提出强化学习调频框架RLTP。该框架通过学习序贯生成整个投放周期内选中概率的调频智能体，结合定制化奖励估计器协同优化曝光量与投放效果双目标：在保证曝光量的同时惩罚超额投放并最大化流量价值。基于大规模工业数据集的实验表明，RLTP显著优于基线调频算法。我们已将RLTP框架部署至在线广告平台，实际数据显示其在投放完成率与点击率等核心指标上均实现显著提升。