We consider a continuous-time multi-arm bandit problem (CTMAB), where the learner can sample arms any number of times in a given interval and obtain a random reward from each sample, however, increasing the frequency of sampling incurs an additive penalty/cost. Thus, there is a tradeoff between obtaining large reward and incurring sampling cost as a function of the sampling frequency. The goal is to design a learning algorithm that minimizes regret, that is defined as the difference of the payoff of the oracle policy and that of the learning algorithm. CTMAB is fundamentally different than the usual multi-arm bandit problem (MAB), e.g., even the single-arm case is non-trivial in CTMAB, since the optimal sampling frequency depends on the mean of the arm, which needs to be estimated. We first establish lower bounds on the regret achievable with any algorithm and then propose algorithms that achieve the lower bound up to logarithmic factors. For the single-arm case, we show that the lower bound on the regret is $\Omega((\log T)^2/\mu)$, where $\mu$ is the mean of the arm, and $T$ is the time horizon. For the multiple arms case, we show that the lower bound on the regret is $\Omega((\log T)^2 \mu/\Delta^2)$, where $\mu$ now represents the mean of the best arm, and $\Delta$ is the difference of the mean of the best and the second-best arm. We then propose an algorithm that achieves the bound up to constant terms.
翻译:本文考虑一类连续时间多臂赌博机问题(CTMAB),其中学习者在给定时间区间内可任意次数采样臂杆,每次采样获得随机奖励,但采样频率的增加会带来附加惩罚/成本。因此,需在大额奖励与采样成本之间进行权衡,二者均为采样频率的函数。目标在于设计一种能最小化遗憾值(定义为最优策略收益与学习算法收益之差)的学习算法。CTMAB与常规多臂赌博机问题(MAB)存在本质差异:例如,单臂情形在CTMAB中已非平凡问题,因为最优采样频率依赖于待估计的臂杆均值。我们首先建立任意算法可实现的遗憾值下界,随后提出能达到该下界(对数因子内)的算法。对单臂情形,我们证明遗憾下界为 $\Omega((\log T)^2/\mu)$,其中 $\mu$ 为臂杆均值,$T$ 为时间跨度。对多臂情形,我们证明遗憾下界为 $\Omega((\log T)^2 \mu/\Delta^2)$,其中 $\mu$ 表示最优臂均值,$\Delta$ 为最优臂与次优臂的均值差。进而提出能够在常数项内达到该下界的算法。