基于预算平衡约束的在线学习新基准 (A New Benchmark for Online Learning with Budget-Balancing Constraints)

The adversarial Bandit with Knapsack problem is a multi-armed bandits problem with budget constraints and adversarial rewards and costs. In each round, a learner selects an action to take and observes the reward and cost of the selected action. The goal is to maximize the sum of rewards while satisfying the budget constraint. The classical benchmark to compare against is the best fixed distribution over actions that satisfies the budget constraint in expectation. Unlike its stochastic counterpart, where rewards and costs are drawn from some fixed distribution (Badanidiyuru et al., 2018), the adversarial BwK problem does not admit a no-regret algorithm for every problem instance due to the "spend-or-save" dilemma (Immorlica et al., 2022). A key problem left open by existing works is whether there exists a weaker but still meaningful benchmark to compare against such that no-regret learning is still possible. In this work, we present a new benchmark to compare against, motivated both by real-world applications such as autobidding and by its underlying mathematical structure. The benchmark is based on the Earth Mover's Distance (EMD), and we show that sublinear regret is attainable against any strategy whose spending pattern is within EMD $o(T^2)$ of any sub-pacing spending pattern. As a special case, we obtain results against the "pacing over windows" benchmark, where we partition time into disjoint windows of size $w$ and allow the benchmark strategies to choose a different distribution over actions for each window while satisfying a pacing budget constraint. Against this benchmark, our algorithm obtains a regret bound of $\tilde{O}(T/\sqrt{w}+\sqrt{wT})$. We also show a matching lower bound, proving the optimality of our algorithm in this important special case. In addition, we provide further evidence of the necessity of the EMD condition for obtaining a sublinear regret.

翻译：对抗性背包老虎机问题是一种带有预算约束以及对抗性奖励与成本的经典多臂老虎机问题。在每一轮中，学习者选择一个动作执行，并观察到所选动作的奖励与成本。其目标是在满足预算约束的同时最大化累计奖励。传统的比较基准是满足期望预算约束的最佳固定动作分布。与随机版本（奖励与成本从某个固定分布中抽取，Badanidiyuru等人，2018）不同，由于“支出或储蓄”困境（Immorlica等人，2022），对抗性BwK问题并非对所有问题实例都存在无遗憾算法。现有研究遗留的一个关键问题是：是否存在一个更弱但仍具意义的比较基准，使得无遗憾学习仍然可能实现。在本工作中，我们提出了一个新的比较基准，其动机既源于自动出价等实际应用，也源于其内在的数学结构。该基准基于Earth Mover's Distance（EMD），我们证明对于任何支出模式与任意次步调支出模式的EMD距离在$o(T^2)$范围内的策略，均可实现次线性遗憾。作为一个特例，我们得到了针对“窗口步调”基准的结果，其中我们将时间划分为大小为$w$的不相交窗口，并允许基准策略为每个窗口选择不同的动作分布，同时满足步调预算约束。针对此基准，我们的算法获得了$\tilde{O}(T/\sqrt{w}+\sqrt{wT})$的遗憾界。我们还给出了匹配的下界，证明了我们的算法在这一重要特例中的最优性。此外，我们进一步提供了EMD条件对于获得次线性遗憾的必要性证据。