We study online learning in repeated first-price auctions where a bidder, only observing the winning bid at the end of each auction, learns to adaptively bid in order to maximize her cumulative payoff. To achieve this goal, the bidder faces censored feedback: if she wins the bid, then she is not able to observe the highest bid of the other bidders, which we assume is \textit{iid} drawn from an unknown distribution. In this paper, we develop the first learning algorithm that achieves a near-optimal $\widetilde{O}(\sqrt{T})$ regret bound, by exploiting two structural properties of first-price auctions, i.e. the specific feedback structure and payoff function. We first formulate the feedback structure in first-price auctions as partially ordered contextual bandits, a combination of the graph feedback across actions (bids), the cross learning across contexts (private values), and a partial order over the contexts. We establish both strengths and weaknesses of this framework, by showing a curious separation that a regret nearly independent of the action/context sizes is possible under stochastic contexts, but is impossible under adversarial contexts. In particular, this framework leads to an $O(\sqrt{T}\log^{2.5}T)$ regret for first-price auctions when the bidder's private values are \emph{iid}. Despite the limitation of the above framework, we further exploit the special payoff function of first-price auctions to develop a sample-efficient algorithm even in the presence of adversarially generated private values. We establish an $O(\sqrt{T}\log^3 T)$ regret bound for this algorithm, hence providing a complete characterization of optimal learning guarantees for first-price auctions.
翻译:我们研究了重复第一价格拍卖中的在线学习问题,在该问题中,竞拍者仅能观察到每次拍卖结束时的获胜出价,并需学习自适应投标以最大化其累积收益。为实现这一目标,竞拍方面临着审查反馈:若其赢得竞标,则无法观察到其他竞拍者的最高出价,我们假设该出价来自一个未知分布的独立同分布变量。本文通过利用第一价格拍卖的两种结构特性(即特定的反馈结构与收益函数),首次提出了一种能达到近最优$\widetilde{O}(\sqrt{T})$遗憾界的学习算法。我们首先将第一价格拍卖中的反馈结构形式化为部分有序上下文赌博机,该模型结合了跨动作(出价)的图反馈、跨上下文(私有价值)的交叉学习以及上下文间的偏序关系。我们通过揭示一个有趣的分离现象来确立该框架的优缺点:在随机上下文下可实现几乎与动作/上下文规模无关的遗憾界,但在对抗性上下文下则不可能。具体而言,该框架在竞拍者私有价值为独立同分布时,为第一价格拍卖提供了$O(\sqrt{T}\log^{2.5}T)$的遗憾界。尽管上述框架存在局限性,我们进一步利用第一价格拍卖的特殊收益函数,开发了一种即使在对抗性生成的私有价值环境下仍具样本效率的算法。我们为该算法建立了$O(\sqrt{T}\log^3 T)$的遗憾界,从而完整刻画了第一价格拍卖的最优学习保障。