In this work, we investigate the online learning problem of revenue maximization in ad auctions, where the seller needs to learn the click-through rates (CTRs) of each ad candidate and charge the price of the winner through a pay-per-click manner. We focus on two models of the advertisers' strategic behaviors. First, we assume that the advertiser is completely myopic; i.e.~in each round, they aim to maximize their utility only for the current round. In this setting, we develop an online mechanism based on upper-confidence bounds that achieves a tight $O(\sqrt{T})$ regret in the worst-case and negative regret when the values are static across all the auctions and there is a gap between the highest expected value (i.e.~value multiplied by their CTR) and second highest expected value ad. Next, we assume that the advertiser is non-myopic and cares about their long term utility. This setting is much more complex since an advertiser is incentivized to influence the mechanism by bidding strategically in earlier rounds. In this setting, we provide an algorithm to achieve negative regret for the static valuation setting (with a positive gap), which is in sharp contrast with the prior work that shows $O(T^{2/3})$ regret when the valuation is generated by adversary.
翻译:本文研究了广告拍卖中收益最大化的在线学习问题,卖方需学习每个广告候选的点击率(CTR),并通过按点击付费方式向获胜方收取费用。我们重点分析广告主策略行为的两种模型。首先,假设广告主完全短视,即每轮仅追求当前轮次效用最大化。在此设定下,我们基于置信上界开发了一种在线机制,在最坏情况下可实现紧致的$O(\sqrt{T})$遗憾值,当所有拍卖中的价值为静态且最高期望价值(即价值乘以CTR)与次高期望价值广告之间存在间距时,该机制能实现负遗憾值。其次,假设广告主具有非短视特征并关注长期效用。此设定复杂度显著提升,因为广告主有动机通过前期战略性出价影响机制运行。针对该场景,我们提出一种算法,在静态估值(存在正向间距)设定下实现负遗憾值——这与先前研究中当估值由对手生成时呈现$O(T^{2/3})$遗憾值的结果形成鲜明对比。