We consider a decision maker allocating one unit of renewable and divisible resource in each period on a number of arms. The arms have unknown and random rewards whose means are proportional to the allocated resource and whose variances are proportional to an order $b$ of the allocated resource. In particular, if the decision maker allocates resource $A_i$ to arm $i$ in a period, then the reward $Y_i$ is$Y_i(A_i)=A_i \mu_i+A_i^b \xi_{i}$, where $\mu_i$ is the unknown mean and the noise $\xi_{i}$ is independent and sub-Gaussian. When the order $b$ ranges from 0 to 1, the framework smoothly bridges the standard stochastic multi-armed bandit and online learning with full feedback. We design two algorithms that attain the optimal gap-dependent and gap-independent regret bounds for $b\in [0,1]$, and demonstrate a phase transition at $b=1/2$. The theoretical results hinge on a novel concentration inequality we have developed that bounds a linear combination of sub-Gaussian random variables whose weights are fractional, adapted to the filtration, and monotonic.
翻译:我们考虑一个决策者在每个周期内将一单位可再生且可分割的资源分配到若干臂上。这些臂具有未知且随机的回报,其均值与分配资源成正比,方差与分配资源的$b$次阶成正比。具体而言,若决策者在某周期内向臂$i$分配资源$A_i$,则回报$Y_i$满足$Y_i(A_i)=A_i \mu_i+A_i^b \xi_{i}$,其中$\mu_i$为未知均值,噪声$\xi_{i}$独立且呈次高斯分布。当阶数$b$从0到1变化时,该框架平滑地衔接了标准随机多臂赌博机与全反馈在线学习。我们设计了两种算法,在$b\in [0,1]$范围内达到最优的间隙相关与间隙无关遗憾界,并展示了$b=1/2$处的相变。理论结果依赖于我们新开发的一个浓度不等式,该不等式界定了次高斯随机变量的线性组合,其中权重为分数阶、适应于滤波且单调。