We study a stochastic budget-allocation problem over $K$ tasks. At each round $t$, the learner chooses an allocation $X_t \in Δ_K$. Task $k$ succeeds with probability $F_k(X_{t,k})$, where $F_1,\dots,F_K$ are nondecreasing budget-to-success curves, and upon success yields a random reward with unknown mean $μ_k$. The learner observes which tasks succeed, and observes a task's reward only upon success (censored semi-bandit feedback). This model captures, for instance, splitting payments across crowdsourcing workers or distributing bids across simultaneous auctions, and subsumes stochastic multi-armed bandits and semi-bandits. We design an optimism-based algorithm that operates under censored semi-bandit feedback. Our main result shows that in diminishing-returns regimes, the regret of this algorithm scales polylogarithmically with the horizon $T$ without any ad hoc tuning. For general nondecreasing curves, we prove that the same algorithm (with the same tuning) achieves a worst-case regret upper bound of $\tilde O(K\sqrt{T})$. Finally, we establish a matching worst-case regret lower bound of $Ω(K\sqrt{T})$ that holds even for full-feedback algorithms, highlighting the intrinsic hardness of our problem outside diminishing returns.
翻译:本文研究一个包含$K$个任务的随机预算分配问题。在每一轮$t$中,学习者选择一个分配方案$X_t \in Δ_K$。任务$k$以概率$F_k(X_{t,k})$成功,其中$F_1,\dots,F_K$是单调非降的预算-成功曲线,成功时产生一个具有未知均值$μ_k$的随机奖励。学习者能够观测到哪些任务成功,但仅当任务成功时才能观测到其奖励(截断半赌博反馈)。该模型捕捉了诸如在众包工作者间分配报酬或在同步拍卖中分配出价等场景,并涵盖了随机多臂赌博机和半赌博机问题。我们设计了一种基于乐观原则的算法,该算法在截断半赌博反馈下运行。我们的主要结果表明,在收益递减机制下,该算法的遗憾随视野$T$呈多对数尺度增长,且无需任何特定调参。对于一般的单调非降曲线,我们证明同一算法(采用相同调参)可实现$\tilde O(K\sqrt{T})$的最坏情况遗憾上界。最后,我们建立了一个匹配的最坏情况遗憾下界$Ω(K\sqrt{T})$,该下界即使对全反馈算法也成立,这凸显了在收益递减机制之外本问题固有的困难性。