Optimal No-regret Learning in Repeated First-price Auctions

We study online learning in repeated first-price auctions where a bidder, only observing the winning bid at the end of each auction, learns to adaptively bid in order to maximize her cumulative payoff. To achieve this goal, the bidder faces a censored feedback: if she wins the bid, then she is not able to observe the highest bid of the other bidders, which we assume is \textit{iid} drawn from an unknown distribution. In this paper, we develop the first learning algorithm that achieves a near-optimal $\widetilde{O}(\sqrt{T})$ regret bound, by exploiting two structural properties of first-price auctions, i.e. the specific feedback structure and payoff function. The feedback in first-price auctions combines the graph feedback across actions (bids), the cross learning across contexts (private values), and a partial order over the contexts; we generalize it as the partially ordered contextual bandits. We establish both strengths and weaknesses of this framework, by showing a curious separation that a regret nearly independent of the action/context sizes is possible under stochastic contexts, but is impossible under adversarial contexts. In particular, this framework leads to an $O(\sqrt{T}\log^{2.5}T)$ regret for first-price auctions when the bidder's private values are \emph{iid}. Despite the limitation of the above framework, we further exploit the special payoff function of first-price auctions to develop a sample-efficient algorithm even in the presence of adversarially generated private values. We establish an $O(\sqrt{T}\log^3 T)$ regret bound for this algorithm, hence providing a complete characterization of optimal learning guarantees for first-price auctions.

翻译：我们研究了重复第一价格拍卖中的在线学习问题，投标人仅通过每次拍卖结束时观察中标价格，学习自适应出价以最大化其累积收益。为实现这一目标，投标人面临删失反馈：若其赢得中标，则无法观察到其他投标人的最高出价，我们假设该出价服从未知分布的独立同分布。本文通过利用第一价格拍卖的两个结构性特性——即特定的反馈结构与收益函数——首次提出了一个达到近最优$\widetilde{O}(\sqrt{T})$遗憾界的学习算法。第一价格拍卖中的反馈结合了跨动作（出价）的图反馈、跨上下文（私人价值）的交叉学习以及上下文上的偏序关系；我们将其泛化为偏序上下文赌博机。我们建立了该框架的优势与局限性，展示了在随机上下文下可实现近乎独立于动作/上下文规模的遗憾，但在对抗性上下文下却无法实现的奇特分离现象。特别地，当投标人的私人价值独立同分布时，该框架可达成$O(\sqrt{T}\log^{2.5}T)$的遗憾界。尽管上述框架存在局限，我们进一步利用第一价格拍卖的特殊收益函数，在私人价值由对抗性生成的情况下，开发了一个样本高效的算法。我们为该算法确立了$O(\sqrt{T}\log^3 T)$的遗憾界，从而完整刻画了第一价格拍卖的最优学习保证。