We study the stochastic Budgeted Multi-Armed Bandit (MAB) problem, where a player chooses from $K$ arms with unknown expected rewards and costs. The goal is to maximize the total reward under a budget constraint. A player thus seeks to choose the arm with the highest reward-cost ratio as often as possible. Current state-of-the-art policies for this problem have several issues, which we illustrate. To overcome them, we propose a new upper confidence bound (UCB) sampling policy, $\omega$-UCB, that uses asymmetric confidence intervals. These intervals scale with the distance between the sample mean and the bounds of a random variable, yielding a more accurate and tight estimation of the reward-cost ratio compared to our competitors. We show that our approach has logarithmic regret and consistently outperforms existing policies in synthetic and real settings.
翻译:我们研究随机预算约束下的多臂赌博机(MAB)问题,其中玩家需从$K$个期望奖励与成本均未知的臂中进行选择。目标是在预算约束下最大化总奖励,因此玩家需尽可能频繁地选择具有最高奖励-成本比的臂。针对该问题的现有最优策略存在若干缺陷,我们对此进行了阐述。为解决这些问题,我们提出了一种新的上置信界(UCB)采样策略——$\omega$-UCB,该策略采用非对称置信区间。这些区间随样本均值与随机变量边界之间的距离动态缩放,相较于现有方法能更精确、更紧致地估计奖励-成本比。我们证明了该方法具有对数级遗憾,并在合成与实际场景中持续优于既有策略。