In several applications of the stochastic multi-armed bandit problem, the traditional objective of maximizing the expected total reward can be inappropriate. In this paper, motivated by certain operational concerns in online platforms, we consider a new objective in the classical setup. Given $K$ arms, instead of maximizing the expected total reward from $T$ pulls (the traditional "sum" objective), we consider the vector of total rewards earned from each of the $K$ arms at the end of $T$ pulls and aim to maximize the expected highest total reward across arms (the "max" objective). For this objective, we show that any policy must incur an instance-dependent asymptotic regret of $\Omega(\log T)$ (with a higher instance-dependent constant compared to the traditional objective) and a worst-case regret of $\Omega(K^{1/3}T^{2/3})$. We then design an adaptive explore-then-commit policy featuring exploration based on appropriately tuned confidence bounds on the mean reward and an adaptive stopping criterion, which adapts to the problem difficulty and achieves these bounds (up to logarithmic factors). We then generalize our algorithmic insights to the problem of maximizing the expected value of the average total reward of the top $m$ arms with the highest total rewards. Our numerical experiments demonstrate the efficacy of our policies compared to several natural alternatives in practical parameter regimes. We discuss applications of these new objectives to the problem of grooming an adequate supply of value-providing market participants (workers/sellers/service providers) in online platforms.
翻译:在随机多臂老虎机问题的若干应用中,最大化期望总回报的传统目标可能并不恰当。本文受在线平台特定运营问题的启发,在经典框架下考虑了一个新目标。给定K个臂,我们不再追求T次拉动中期望总回报的最大化(传统"求和"目标),而是考虑T次拉动结束后各臂获得的总回报向量,并旨在最大化各臂中最高期望总回报("最大化"目标)。针对该目标,我们证明任何策略必然产生实例依赖的渐近遗憾Ω(log T)(其实例依赖常数高于传统目标),以及最坏情况遗憾Ω(K^{1/3}T^{2/3})。随后我们设计了一种自适应探索-提交策略,该策略基于适当校准的均值置信区间进行探索,并采用自适应停止准则以适应问题难度,最终达到了这些界限(对数因子范围内)。我们进一步将算法洞见推广至最大化前m个最高总回报臂的平均期望总回报问题。数值实验表明,在实际参数范围内,我们的策略相较于多种自然替代方案具有显著优势。最后,我们讨论了这些新目标在在线平台中培育充足优质市场参与者(工人/卖家/服务提供者)供给问题上的应用。