RLTP: Reinforcement Learning to Pace for Delayed Impression Modeling in Preloaded Ads

To increase brand awareness, many advertisers conclude contracts with advertising platforms to purchase traffic and then deliver advertisements to target audiences. In a whole delivery period, advertisers usually desire a certain impression count for the ads, and they also expect that the delivery performance is as good as possible (e.g., obtaining high click-through rate). Advertising platforms employ pacing algorithms to satisfy the demands via adjusting the selection probabilities to traffic requests in real-time. However, the delivery procedure is also affected by the strategies from publishers, which cannot be controlled by advertising platforms. Preloading is a widely used strategy for many types of ads (e.g., video ads) to make sure that the response time for displaying after a traffic request is legitimate, which results in delayed impression phenomenon. Traditional pacing algorithms cannot handle the preloading nature well because they rely on immediate feedback signals, and may fail to guarantee the demands from advertisers. In this paper, we focus on a new research problem of impression pacing for preloaded ads, and propose a Reinforcement Learning To Pace framework RLTP. It learns a pacing agent that sequentially produces selection probabilities in the whole delivery period. To jointly optimize the two objectives of impression count and delivery performance, RLTP employs tailored reward estimator to satisfy the guaranteed impression count, penalize the over-delivery and maximize the traffic value. Experiments on large-scale industrial datasets verify that RLTP outperforms baseline pacing algorithms by a large margin. We have deployed the RLTP framework online to our advertising platform, and results show that it achieves significant uplift to core metrics including delivery completion rate and click-through rate.

翻译：为了提升品牌知名度，许多广告主与广告平台签订合同购买流量，并向目标受众投放广告。在整个投放周期内，广告主通常期望广告达到一定的曝光量，同时希望投放效果尽可能优化（例如获得高点击率）。广告平台采用节奏控制算法，通过实时调整对流量请求的选择概率来满足这些需求。然而，投放过程还会受到发布者策略的影响，而这些策略并非广告平台所能控制。预加载是许多广告类型（如视频广告）广泛采用的策略，旨在确保流量请求后展示广告的响应时间合理，但这会导致延迟曝光现象。传统的节奏控制算法因其依赖即时反馈信号，难以有效处理预加载特性，可能无法保证广告主的需求。本文聚焦于预加载广告曝光节奏控制这一新研究问题，提出RLTP（强化学习节奏控制框架）。该框架学习一个节奏控制智能体，在完整投放周期内依次生成选择概率。为了联合优化曝光量与投放效果两个目标，RLTP采用定制化奖励估计器，在满足保证曝光量的同时惩罚过度投放并最大化流量价值。基于大规模工业数据集的实验表明，RLTP显著优于基线节奏控制算法。我们已将RLTP框架部署至在线广告平台，结果显示其在投放完成率、点击率等核心指标上实现了显著提升。